Machine Learning (ML) shows promise to accelerate the rate of scientific progress.[Jablonka et al. (2020); Butler et al. (2018); Yano et al. (2022); Yao et al. (2022); De Luna et al. (2017); Wang et al. (2023)] Recent progress in the field has demonstrated, for example, the ability of machine learning (ML) models to make predictions for multiscale systems,[Charalambous et al. (2024); W. Yang et al. (2020); Deringer et al. (2021)] to perform experiments by interacting with laboratory equipment [Boiko et al. (2023); Coley et al. (2019)], to autonomously collect data from scientific literature,[Schilling-Wilhelmi et al. (2025); W. Zhang et al. (2024); Dagdelen et al. (2024)] and to make predictions with high accuracy.[Jablonka et al. (2024); Jablonka et al. (2023); Jung, Jung, and Cole (2024); Rupp et al. (2012); Keith et al. (2021); J. Wu et al. (2024)]
However, the diversity and scale of chemical data create a unique challenge for applying ML to the chemical sciences. This diversity manifests across temporal, spatial, and representational dimensions. Temporally, chemical processes span femtosecond-scale spectroscopic events to year-long stability studies of pharmaceuticals or batteries, demanding data sampled at resolutions tailored to each time regime. Spatially, systems range from the atomic to the industrial scale, requiring models that bridge molecular behavior to macroscopic properties. Representationally, even a single observation (e.g., a 13C-NMR spectrum) can be encoded in chemically equivalent formats: a string [Alberts et al. (2024)], vector [Mirza and Jablonka (2024)], or image[Alberts et al. (2024)]. However, such representations are not computationally equivalent and have been empirically shown to produce different model outputs.[Atz et al. (2024); Alampara et al. (2025); J.-N. Wu et al. (2024); Skinnider (2024)]
Additionally, ML for chemistry is challenged by what one can term “hidden variables”. These can be thought of as the parameters in an experiment that remain largely unaccounted for (e.g., their importance is unknown, or they are difficult to control for), but could have a significant impact on experimental outcomes. One example is seasonal variations in ambient laboratory conditions that are typically not controlled for and, if at all, only communicated in private accounts.[Nega et al. (2021)] In addition to that, chemistry is believed to rely on a large amount of tacit knowledge, i.e., knowledge that cannot be readily verbalized.[Taber (2014); Polanyi (2009)] Tacit chemical knowledge includes the subtle nuances of experimental procedures, troubleshooting techniques, and the ability to anticipate potential problems based on experience.
These factors—the diversity, scale, and tacity—clearly indicate that the full complexity of chemistry cannot be captured using standard approaches with bespoke representations based on structured data.[Jablonka, Patiny, and Smit (2022)] Fully addressing the challenges imposed by chemistry requires the development of ML systems that can handle diverse, “fuzzy”, data instances and have transferable capabilities to leverage low amounts of data.
“Foundation model” has become a popular term for large pretrained models that serve as a basis for various downstream tasks. The first comprehensive description of such models was provided by Bommasani et al. (2021), who also coined the term “foundation models”. In the chemical literature, this term has different connotations. In many cases, however, the term is used to represent a domain-specific, state-of-the-art model limited to one input modality (e.g., amino acid sequences, crystal structures). Here, we make the distinction between what we term general-purpose model (GPM)s, such as large language model (LLM)s [D. Zhang et al. (2024); Guo et al. (2025); OpenAI et al. (2023); Anthropic (n.d.); Brown et al. (2020)] and domain-specific models with state-of-the-art (SOTA) performance in a subset of tasks, such as machine-learning interatomic potentials.[Batatia et al. (2023); Chen and Ong (2022); Unke et al. (2021)] We adopt the term GPMs to avoid the semantic overlap caused by “foundation model” and to signal the breadth of applicability that we seek to emphasize.
A GPM is a model that has been pre-trained on a broad, heterogeneous corpus spanning multiple data modalities (text, images, graphs) or representations (e.g., common names, 3D coordinates, molecular images). It can be applied to a wide spectrum of downstream tasks that differ in objective (classification, regression, generation, reasoning), input format, and domain—ranging from natural-language processing to chemistry and vision—with little or no task-specific fine-tuning. A GPM supports zero-shot, few-shot, or transfer learning and can serve as the core component of autonomous agents.
Alampara, Nawaf, Mara Schilling-Wilhelmi, Martiño Ríos-García, Indrajeet Mandal, Pranav Khetarpal, Hargun Singh Grover, N. M. Anoop Krishnan, and Kevin Maik Jablonka. 2025.
“Probing the Limitations of Multimodal Language Models for Chemistry and Materials Research.” Nature Computational Science.
https://doi.org/10.1038/s43588-025-00836-3.
Alberts, Marvin, Oliver Schilter, Federico Zipoli, Nina Hartrampf, and Teodoro Laino. 2024.
“Unraveling Molecular Structure: A Multimodal Spectroscopic Dataset for Chemistry.” arXiv.
https://doi.org/10.48550/arXiv.2407.17492.
Anthropic. n.d.
“System Card: Claude Opus 4 & Claude Sonnet 4.” Accessed October 9, 2025.
https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf.
Atz, Kenneth, Leandro Cotos, Clemens Isert, Maria Håkansson, Dorota Focht, Mattis Hilleke, David F Nippa, et al. 2024.
“Prospective de novo drug design with deep interactome learning.” Nature Communications 15 (1): 3408.
https://doi.org/s41467-024-47613-w.
Batatia, Ilyes, Philipp Benner, Yuan Chiang, Alin M Elena, Dávid P Kovács, Janosh Riebesell, Xavier R Advincula, et al. 2023.
“A foundation model for atomistic materials chemistry.” arXiv.
https://doi.org/10.48550/arXiv.2401.00096.
Boiko, Daniil A, Robert MacKnight, Ben Kline, and Gabe Gomes. 2023.
“Autonomous chemical research with large language models.” Nature 624 (7992): 570–78.
https://doi.org/10.1038/s41586-023-06792-0.
Bommasani, Rishi, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, et al. 2021.
“On the opportunities and risks of foundation models.” arXiv.
https://doi.org/10.48550/arXiv.2108.07258.
Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020.
“Language Models Are Few-shot Learners.” Advances in Neural Information Processing Systems 33: 1877–1901.
https://doi.org/10.48550/arXiv.2005.14165.
Butler, Keith T., Daniel W. Davies, Hugh Cartwright, Olexandr Isayev, and Aron Walsh. 2018.
“Machine Learning for Molecular and Materials Science.” Nature 559 (7715): 547–55.
https://doi.org/10.1038/s41586-018-0337-2.
Charalambous, Charithea, Elias Moubarak, Johannes Schilling, Eva Sanchez Fernandez, Jin-Yu Wang, Laura Herraiz, Fergus Mcilwaine, et al. 2024.
“A holistic platform for accelerating sorbent-based carbon capture.” Nature 632 (8023): 89–94.
https://doi.org/10.1038/s41586-024-07683-8.
Chen, Chi, and Shyue Ping Ong. 2022.
“A Universal Graph Deep Learning Interatomic Potential for the Periodic Table.” Nature Computational Science 2 (11): 718–28.
https://doi.org/10.1038/s43588-022-00349-3.
Coley, Connor W, Dale A Thomas III, Justin AM Lummiss, Jonathan N Jaworski, Christopher P Breen, Victor Schultz, Travis Hart, et al. 2019.
“A robotic platform for flow synthesis of organic compounds informed by AI planning.” Science 365 (6453): eaax1566.
https://doi.org/10.1126/science.aax1566.
Dagdelen, John, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S Rosen, Gerbrand Ceder, Kristin A Persson, and Anubhav Jain. 2024.
“Structured Information Extraction from Scientific Text with Large Language Models.” Nature Communications 15 (1): 1418.
https://doi.org/10.1038/s41467-024-45563-x.
De Luna, Phil, Jennifer Wei, Yoshua Bengio, Alán Aspuru-Guzik, and Edward Sargent. 2017.
“Use Machine Learning to Find Energy Materials.” Nature 552 (7683): 23–27.
https://doi.org/10.1038/d41586-017-07820-6.
Deringer, Volker L., Noam Bernstein, Gábor Csányi, Chiheb Ben Mahmoud, Michele Ceriotti, Mark Wilson, David A. Drabold, and Stephen R. Elliott. 2021.
“Origins of Structural and Electronic Transitions in Disordered Silicon.” Nature 589 (7840): 59–64.
https://doi.org/10.1038/s41586-020-03072-z.
Google DeepMind. n.d.
“Gemini Diffusion: a new kind of text model.” Accessed October 9, 2025.
https://deepmind.google/models/gemini-diffusion/.
Grattafiori, Aaron, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, et al. 2024.
“The Llama 3 Herd of Models.” arXiv.
https://doi.org/10.48550/arXiv.2407.21783.
Gu, Albert, and Tri Dao. 2023.
“Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” arXiv.
https://doi.org/10.48550/arXiv.2312.00752.
Guo, Daya, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, et al. 2025.
“Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.” arXiv.
https://doi.org/10.48550/arXiv.2501.12948.
Jablonka, Kevin Maik, Charithea Charalambous, Eva Sanchez Fernandez, Georg Wiechers, Juliana Monteiro, Peter Moser, Berend Smit, and Susana Garcia. 2023.
“Machine learning for industrial processes: Forecasting amine emissions from a carbon capture plant.” Science Advances 9 (1): eadc9576.
https://doi.org/10.1126/sciadv.adc9576.
Jablonka, Kevin Maik, Daniele Ongari, Seyed Mohamad Moosavi, and Berend Smit. 2020.
“Big-data science in porous materials: materials genomics and machine learning.” Chemical Reviews 120 (16): 8066–8129.
https://doi.org/10.1021/acs.chemrev.0c00004.
Jablonka, Kevin Maik, Luc Patiny, and Berend Smit. 2022.
“Making the collective knowledge of chemistry open and machine actionable.” Nature Chemistry 14 (4): 365–76.
https://doi.org/10.1038/s41557-022-00910-7.
Jablonka, Kevin Maik, Philippe Schwaller, Andres Ortega-Guerrero, and Berend Smit. 2024.
“Leveraging large language models for predictive chemistry.” Nature Machine Intelligence 6 (2): 161–69.
https://doi.org/10.1038/s42256-023-00788-1.
Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021.
“Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596 (7873): 583–89.
https://doi.org/10.1038/s41586-021-03819-2.
Jung, Son Gyo, Guwon Jung, and Jacqueline M Cole. 2024.
“Automatic Prediction of Molecular Properties Using Substructure Vector Embeddings within a Feature Selection Workflow.” Journal of Chemical Information and Modeling 65 (1): 133–52.
https://doi.org/10.1021/acs.jcim.4c01862.
Keith, John A., Valentin Vassilev-Galindo, Bingqing Cheng, Stefan Chmiela, Michael Gastegger, Klaus-Robert Müller, and Alexandre Tkatchenko. 2021.
“Combining Machine Learning and Computational Chemistry for Predictive Insights into Chemical Systems.” Chemical Reviews 121 (16): 9816–72.
https://doi.org/10.1021/acs.chemrev.1c00107.
Labs, Inception, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, et al. 2025.
“Mercury: Ultra-Fast Language Models Based on Diffusion.” arXiv.
https://doi.org/10.48550/arXiv.2506.17298.
Lin, Zeming, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, et al. 2023.
“Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model.” Science 379 (6637): 1123–30.
https://doi.org/10.1126/science.ade2574.
Mirza, A., and K. M. Jablonka. 2024.
“Elucidating Structures from Spectra Using Multimodal Embeddings and Discrete Optimization.” ChemRxiv.
https://doi.org/10.26434/chemrxiv-2024-f3b18-v2.
Nega, Philip W., Zhi Li, Victor Ghosh, Janak Thapa, Shijing Sun, Noor Titan Putri Hartono, Mansoor Ani Najeeb Nellikkal, et al. 2021.
“Using Automated Serendipity to Discover How Trace Water Promotes and Inhibits Lead Halide Perovskite Crystal Formation.” Applied Physics Letters 119 (4).
https://doi.org/10.1063/5.0059767.
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, et al. 2023.
“GPT-4 Technical Report.” arXiv.
https://doi.org/10.48550/arXiv.2303.08774.
Polanyi, Michael. 2009. The Tacit Dimension. Reproduction en fac-similé. Chicago: University of Chicago press.
Rupp, Matthias, Alexandre Tkatchenko, Klaus-Robert Müller, and O. Anatole von Lilienfeld. 2012.
“Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning.” Physical Review Letters 108 (5).
https://doi.org/10.1103/physrevlett.108.058301.
Schilling-Wilhelmi, Mara, Martiño Rı́os-Garcı́a, Sherjeel Shabih, Marı́a Victoria Gil, Santiago Miret, Christoph T Koch, José A Márquez, and Kevin Maik Jablonka. 2025.
“From text to insight: large language models for chemical data extraction.” Chemical Society Reviews 54 (3): 1125–50.
https://doi.org/10.1039/d4cs00913d.
Schwaller, Philippe, Teodoro Laino, Théophile Gaudin, Peter Bolgar, Christopher A Hunter, Costas Bekas, and Alpha A Lee. 2019.
“Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction.” ACS Central Science 5 (9): 1572–83.
https://doi.org/10.1021/acscentsci.9b00576.
Skinnider, Michael A. 2024.
“Invalid SMILES are beneficial rather than detrimental to chemical language models.” Nature Machine Intelligence 6 (4): 437–48.
https://doi.org/10.1038/s42256-024-00821-x.
Taber, Keith S. 2014.
“The Significance of Implicit Knowledge for Learning and Teaching Chemistry.” Chemistry Education Research and Practice 15 (4): 447–61.
https://doi.org/10.1039/c4rp00124a.
Taylor, Ross, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. 2022.
“Galactica: A Large Language Model for Science.” arXiv.
https://doi.org/10.48550/arXiv.2211.09085.
Unke, Oliver T, Stefan Chmiela, Huziel E Sauceda, Michael Gastegger, Igor Poltavsky, Kristof T Schutt, Alexandre Tkatchenko, and Klaus-Robert Muller. 2021.
“Machine learning force fields.” Chemical Reviews 121 (16): 10142–86.
https://doi.org/10.1021/acs.chemrev.0c01111.
Wang, Hanchen, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, et al. 2023.
“Scientific Discovery in the Age of Artificial Intelligence.” Nature 620 (7972): 47–60.
https://doi.org/10.1038/s41586-023-06221-2.
Wu, Jianchang, Luca Torresi, ManMan Hu, Patrick Reiser, Jiyun Zhang, Juan S. Rocha-Ortiz, Luyao Wang, et al. 2024.
“Inverse Design Workflow Discovers Hole-Transport Materials Tailored for Perovskite Solar Cells.” Science 386 (6727): 1256–64.
https://doi.org/10.1126/science.ads0901.
Wu, Juan-Ni, Tong Wang, Yue Chen, Li-Juan Tang, Hai-Long Wu, and Ru-Qin Yu. 2024.
“t-SMILES: a fragment-based molecular representation framework for de novo ligand design.” Nature Communications 15 (1): 4993.
https://doi.org/10.1038/s41467-024-49388-6.
Yang, Han, Chenxi Hu, Yichi Zhou, Xixian Liu, Yu Shi, Jielan Li, Guanzhi Li, et al. 2024.
“MatterSim: A Deep Learning Atomistic Model Across Elements, Temperatures and Pressures.” arXiv.
https://doi.org/10.48550/arXiv.2405.04967.
Yang, Wuyue, Liangrong Peng, Yi Zhu, and Liu Hong. 2020.
“When machine learning meets multiscale modeling in chemical reactions.” The Journal of Chemical Physics 153 (9).
https://doi.org/10.1063/5.0015779.
Yano, Junko, Kelly J Gaffney, John Gregoire, Linda Hung, Abbas Ourmazd, Joshua Schrier, James A Sethian, and Francesca M Toma. 2022.
“The case for data science in experimental chemistry: examples and recommendations.” Nature Reviews Chemistry 6 (5): 357–70.
https://doi.org/10.1038/s41570-022-00382-w.
Yao, Zhenpeng, Yanwei Lum, Andrew Johnston, Luis Martin Mejia-Mendoza, Xin Zhou, Yonggang Wen, Alán Aspuru-Guzik, Edward H. Sargent, and Zhi Wei Seh. 2022.
“Machine Learning for a Sustainable Energy Future.” Nature Reviews Materials 8 (3): 202–15.
https://doi.org/10.1038/s41578-022-00490-5.
Zhang, Di, Wei Liu, Qian Tan, Jingdan Chen, Hang Yan, Yuliang Yan, Jiatong Li, et al. 2024.
“Chemllm: A chemical large language model.” arXiv.
https://doi.org/10.48550/arXiv.2402.06852.
Zhang, Wei, Qinggong Wang, Xiangtai Kong, Jiacheng Xiong, Shengkun Ni, Duanhua Cao, Buying Niu, et al. 2024.
“Fine-Tuning Large Language Models for Chemical Text Mining.” Chemical Science 15 (27): 10600–10611.
https://doi.org/10.1039/D4SC00924J.