Toggle Menu
December 14, 2025

AI and Pediatric Cancer Research

The Trump Administration recently issued an executive order titled “Unlocking Cures for Pediatric Cancer with Artificial Intelligence,” with an accompanying announcement from the Department of Health and Human Services (HHS) that they would double spending on the Childhood Cancer Data Initiative (from $50 million to $100 million).

What does it actually entail?

The EO aims to build upon a 2019 initiative about cancer data:

In 2019, my Administration created the Childhood Cancer Data Initiative (CCDI), a Federal investment in childhood cancer research of $50 million in funding every year for 10 years to address the critical need to collect, generate, and analyze childhood cancer data. The CCDI is building a foundational data infrastructure, aggregating and generating new data, and using this data to make new discoveries.

AI can be used to build upon this data initiative to produce meaningful solutions to pediatric, adolescent, and young adulthood cancer.

The new EO’s goal is to fund new research projects focused on:

  • improving data infrastructure so that AI is better able to analyze data across multiple sources and help steer people into clinical trials;
  • using AI tools to “radically improve predictive modeling” of issues like how patients respond to treatment; and,
  • improving clinical trials in many ways (design, access, outcomes, accessibility, recruitment, administration, etc.).

AI and Data

When it comes to AI and science, the government simply can’t compete with the likes of OpenAI or Anthropic in developing new versions of large language models and the like. The GS salary scale doesn’t really compete with salaries that can be 10x plus equity that can be worth many times more than a salary.

But a uniquely useful role that government can play is creating public goods like massive, well-annotated, and comprehensive databases that no one else has the incentive to create, let alone make freely available for the rest of the world to build upon.

Nearly everyone who subscribes to this newsletter will have heard of AlphaFold, the innovative AI tool that can predict protein folding. AlphaFold wouldn’t exist without a huge source of data on protein folding: the Protein Data Bank, which has existed since the 1970s and has been supported by the National Science Foundation, the NIH, and even the Department of Energy (!). And it’s not a trivial expenditure: earlier this year, these federal agencies announced funding for the Protein Data Bank reaching nearly $50 million over the next few years.

Ideally, the Childhood Cancer Data Initiative (CCDI) could play the same role in the cancer sector: creating a gigantic database that anyone else (including AI companies) can use to make new discoveries.

All of that said, Google Scholar lists only 253 articles (or webpages) that cite CCDI. Judging from the first 100 results, almost all of them merely mention or describe CCDI (as opposed to actually using the underlying data). The website itself lists only 32 studies so far in the past 6 years (at least 13 of which seem to be only in the planning stage, and have not resulted in any publications yet).

I reached out to more than a dozen scholars who have used the CCDI data in some fashion. I also reached out to a researcher at DeepMind (which is arguably the world’s leader in producing AI biology models).

I heard the following reactions:

While all attention to childhood cancer is good, this particular funding will be productive only if funds for data aggregation and analysis are paired with funds for detailed clinical and molecular diagnostic annotation from medical records and other data sources. Without this, any computational inference will be futile.

Second reaction:

I agree 100%. The omic data is of very limited value without clinical metadata.

Third response:

As an additional note, this type of data should include treatment and outcomes data. The recipients of your email have all been involved in some version of this type of clinical metadata and omic data collection; however, there are barriers that exist to being able to collect and broadly share amongst scientists.

Fourth response:

In addition to the recommendation about adding more metadata to the final outcome, I think adding more intermediate molecular phenotypes would be helpful for training AI models. Looking at https://ccdi.cancer.gov/explore?tab=0, they have sequencing data for 14k participants, pathology imaging for 4k participants, and gene expression only for 361 participants. For training AI models to better understand the molecular underpinnings of cancer, it would be great to complement whole genome sequencing with gene expression measurements, maybe even using spatial resolution.

A takeaway here: If NIH wants to upscale a pediatric cancer database so that it is most useful to AI researchers, it needs to work closely with top medical and AI researchers who will have the best on-the-ground insights as to what would make the database actually useful.

Interoperability

There’s another huge issue: The Executive Order asks the HHS Secretary to make sure that “AI innovation is appropriately integrated into current work on interoperability to maximize the potential for electronic health record and claims data to inform private sector and academic research and clinical trial design.”

I’ve long thought that we need basic interoperability between clinical trials and electronic health records, particularly since 2013 when Mike Lauer wrote about an impressive clinical trial in Scandinavia (the “TASTE” trial) that only cost about 1% of other trials, because it was built into the health records system as part of ordinary clinical care (see also here).

But this isn’t a trivial issue. For some 20 years, the Office of the National Coordinator (ONC) has been trying to create and incentivize interoperability of health care records. ONC has made some progress, but interoperability is a challenging issue just for electronic health records and claims data, never mind trying to integrate it with the rest of the data on genetics and imaging, and then make all of the above available to AI tools.

One can imagine doing all of this at a fairly small scale within a contained environment. But as one of the nation’s top clinical trialists wrote in a recent policy brief:

A major problem is that our current model in healthcare doesn’t allow us to generate reusable data at the point of care. This is even more frustrating because providers face a high burden of documentation, and patients report repetitive questions from providers and questionnaires.

To expand a bit: while large amounts of data are generated at the point of care, these data lack the quality, standardization, and interoperability to enable downstream functions such as clinical trials, quality improvement, and other ways of generating more knowledge about how to improve outcomes.

The real potential of AI would be better realized if HHS and its operating divisions pushed for the reforms suggested in that piece — i.e., incentivizing providers to use health care records that standardize the right data elements, creating a regulatory framework allowing interoperability for research purposes, and more.

**

In short, the goal of deploying AI on pediatric cancer data may be achievable in the long run, but the effort needs sustained policy design, technical standardization, and coordination/funding across multiple corners of HHS and the broader biomedical ecosystem.