No one individual or organization has the incentive to produce free and large-scale databases that mostly benefit other scientists. Such databases and platforms are classic public goods, and one of the central roles for government funding of science should be to provide them. And there are considerable successes to point to (e.g., the databases hosted at NCBI).
Still, government often fails to provide enough public goods of the right type–perhaps because of other political or funding priorities, or perhaps just because no one has ever raised a particular issue. A key belief of the Good Science Project is that government funders should dedicate a higher percentage of funding to infrastructure–including data repositories, platform technologies, open source software, and more.
To take one example: there is no public, freely downloadable, comprehensive database of all known chemical reactions. Such a database could hugely benefit chemical R&D, i.e., by allowing the power of AI and machine learning to be deployed on many important problems. Pierre Baldi, a professor at the University of California-Irvine, recently published a paper on that point: “Call for a Public Open Database of All Chemical Reactions.” [Note: This post is a lightly re-written version of that article, with Baldi’s permission.]
Baldi’s argument:
In spite of its scientific role linking physics to biology, chemistry has remained in a relatively backward state of informatics development.
Sure, there are large, collaborative databases and projects in biology (e.g., the Human Genome Project) and physics (e.g., the Large Hadron Collider).
But in stark contrast, there are almost no large collaborative efforts and databases in chemical research. Yes, some databases do exist, but they require a license, and the underlying data can only be accessed through a search engine that limits you to one search at a time–thus eliminating the possibility of large-scale machine learning.
A public and comprehensive database of chemical reactions would help the efforts of chemists across the world. Moreover, it would enable expert AI algorithms that might develop the capability to understand and predict chemical reactions. Such systems would in turn be widely applicable to many areas ranging from drug design and discovery, to studies of the origin of life, to materials science, and to environmental sciences and atmospheric chemistry.
We would be able to identify the path towards new products or processes, new startup companies, etc. Ultimately, this would be an untold benefit to the economy and humanity.
How do we create such a chemical database?
The ideal solution would involve an international consortium, but ultimately government agencies have to provide ongoing support.
As Baldi says, “In the end, the most expeditious solution may be for NSF and NIH, the two main scientific governmental organizations, to lead the way together in assembling the proper consortium to create and maintain the database.”
NSF and NIH should move forward with Baldi’s proposal, which is precisely the kind of thing that government can do best.