2 The Shape and Structure of Chemical Data

2.1 Shape of Scientific Data

Data is the essential asset in data-driven techniques. To understand the successes and failures of machine learning (ML) models, it is instructive to explore how the structure of different datasets shapes the learning capabilities of different models. One useful lens for doing so is to consider how complex a system is (i.e., how many variables are needed to describe it) and what fraction of these variables are explicit. One might see the set of variables that are required to describe a system as the state space. A state space encompasses all possible states of a system, similar to concepts in statistical mechanics (SM). However, in contrast to many other problems, we often cannot explicitly enumerate all variables and their potential values in relevant chemical systems. Commonly, many of the essential factors describing a system are implicit (“known unknowns” or “unknown unknowns”).

2.1.1 Irreducible complexity

Figure 2.1 illustrates how the state space of chemistry tends to grow more implicit as we move from describing single atoms or small molecules in vacuo, to real-world systems. For instance, we can completely explain almost all observed phenomena for a hydrogen atom using the position (and atomic numbers) of the hydrogen atom via the Schrödinger equation. As we scale up to larger systems such as macromolecular structures or condensed phases, we have to deal with more “known unknowns” and “unknown unknowns”.[Martin (2022)] For example, it is currently impossible to model a full packed-bed reactor at the atomistic scale because the size of the problem scales with the number of parameters that can be tuned. Often, it becomes infeasible to explicitly label all variables and their values. We can describe such complexity as “irreducible”[Pietsch and Wernecke (2017)], in contrast to “emergent” complexity that emerges from systems that can be described with simple equations, such as a double pendulum.

Figure 2.1: **State space description for chemistry at different scales**. We illustrate how the number of hidden variables (gray) is growing with scale and complexity. For simple systems we can explicitly write down all variables and their values and perfectly describe the system. For more complex systems—closer to practical applications—we can no longer do that. Many more variables cannot be explicitly enumerated.

For phenomena characterized by irreducible complexity, success is often serendipitous. As pointed out by Rulev (2017), chemical literature commonly contains terms such as “to our surprise”, “remarkable reactivity”, or “unusual performance”, which may reflect the complexity of scientific questions and the diminishing explainability of observed results.

2.1.1.1 Emergent complexity

In contrast to irreducible complexity, there is a subset of chemical problems for which all relevant parameters can explicitly be listed, but the complexity emerges from the intricate, potentially chaotic, interactions among them. A well-known example is the Belousov-Zhabotinsky reaction, [Cassani, Monteverde, and Piumetti (2021)] which exhibits oscillations and pattern formation as a result of a complex chemical reaction network. Individual chemical reactions within the network are simple, but their interactions create a dynamic, self-organizing system with properties not seen in the individual components. An example of how fast the parameter space can grow was provided by Koziarski et al. (2024), who show that a single reaction type and a few hundred molecular building blocks can create tens of thousands of possible solutions. When scaling up to only five reaction types, the exploration of the entire space can become intractable, estimated at approximately \(10^{22}\) solutions. When optimization objectives are involved—finding the shortest synthesis pathway or maximizing the yield—such problems are often NP-hard, meaning that no known polynomial-time algorithms can guarantee optimal solutions, though various heuristic and approximation methods can provide good solutions.

Knowing the ratio between explicit and implicit parameters helps in selecting the appropriate model architecture. If most of the variance is caused by explicit factors, these can be incorporated as priors or constraints in the model, thereby increasing data efficiency. This strategy can, for instance, be applied in the development of force fields where we know the governing equations and their symmetries, and can use them to enforce such symmetries in the model architecture (as hard restrictions to a family of solutions). [Unke et al. (2021); Musil et al. (2021)] However, when the variance is dominated by implicit factors, such constraints can no longer be formulated, as the governing relationships are not known. In those cases, flexible general-purpose model (GPM)s with soft inductive biases—which guide the model toward preferred solutions without enforcing strict constraints on the solution space[Wilson (2025)]—are more suitable. large language model (LLM)s fall into this category.

2.2 Scale of Chemical Data

Chemistry is an empirical science in which every prediction bears the burden of proof through experimental validation.[Zunger (2019)] However, there is often a mismatch between the realities of a chemistry lab and the datasets on which ML models for chemistry are trained. Much of current data-driven modeling in chemistry focuses on a few large, structured, and highly-curated datasets where most of the variance is explicit (reducible complexity). Such datasets, QM9 for example,[Ramakrishnan et al. (2014)] often come from quantum-chemical computations. Experimental chemistry, however, tends to have a significantly higher variance and a greater degree of irreducible complexity. In addition, since data generation is often expensive, datasets are small, and because science is about doing new things for the first time, many datasets also contain at least some unique variables.

Considering the largest chemistry text dataset, ChemPile,[Mirza et al. (2025)] we can look at how the largest subsets fare in comparison to the smallest ones (see Table 2.1). The largest dataset is approximately three million times larger than the smallest one.

Table 2.1: Token counts for the three largest and smallest datasets in the ChemPile[Mirza et al. (2025)] collection. Dominating datasets contribute a large portion of the total token count (a token represents the smallest unit of text that a ML model can process), with the small datasets significantly increasing the diversity.

Dataset	Token count
Three largest ChemPile datasets
NOMAD crystal structures[Scheidgen et al. (2023)]	5,808,052,794
Open Reaction Database (ORD)[Kearnes et al. (2021)] reaction prediction	5,347,195,320
`RDKit` molecular features	5,000,435,822
Three smallest ChemPile datasets
Hydrogen storage materials[HyMARC (2019)]	1,935
List of amino-acids[Alberts (2002)]	6,000
ORD[Kearnes et al. (2021)] recipe yield prediction	8,372

The prevalence of many small, specialized datasets over large ones is commonly referred to as “the long tail problem”.[Heidorn (2008)]

Figure 2.2: **Cumulative token count based on the ChemPile tabular datasets (Mirza et al. 2025)**. We compare the approximate token count for three datasets: Llama-3 training dataset,(Grattafiori et al. 2024) openly available chemistry papers in the ChemPile-Paper dataset, and the ChemPile-LIFT dataset. As can be seen, by aggregating the collection of tabular datasets converted to text format in the ChemPile-LIFT subset, we can achieve the same order of magnitude as the collection of open chemistry papers. However, without smaller datasets, we cannot capture the breadth and complexity of chemistry data, which is essential for training GPM. The tokenization method for both ChemPile and Llama-3 is provided in the respective papers.

This can be seen in Figure 2.2. We show that while a few datasets are large, the majority of the corpus consists of small but collectively significant and chemically diverse datasets. The actual tail of chemical data is even larger, as Figure 2.2 only shows the distribution for manually curated tabular datasets and not all data actually created in the chemical sciences. Given that every dataset in the long tail relies on unique sources of variance—it is very difficult to leverage this long tail with conventional ML techniques. However, the promise of GPMs such as LLMs is that they can very flexibly integrate and jointly model the diversity of small datasets that exist in the chemical sciences.

2.3 Dataset Creation

Before a model can be trained or tested, suitable data must first be collected. It is important to note that when working with GPMs, data can—but should not—be directly ingested in a raw format and requires some form of pre-processing. Training GPMs typically requires a large and diverse dataset, compiled in a form that can be efficiently ingested by the training pipeline.

Strategies for doing so can be broadly categorized into two groups (see Figure 2.3). One can utilize a “top-down” approach where a large and diverse pool of data—e.g., results from web-crawled resources such as CommonCrawl[“Common Crawl” (2024)]—is filtered using custom-built procedures (e.g., using regular expressions or classification models). This approach is gaining traction in the development of foundation models such as LLMs.[Penedo et al. (2023); Penedo et al. (2024); Guo et al. (2025)] Alongside large filtered datasets, various data augmentation techniques have further increased the performance of GPMs.[Maini et al. (2024); Pieler et al. (2024)]

Alternatively, one can take a “bottom-up” approach by specifically creating novel datasets for a given problem—an approach which has been very popular in ML for chemistry.

In practice, a combination of both approaches is often used. In most cases, key techniques include filtering and the generation of synthetic data.

Figure 2.3: **Dataset creation protocols**. In “top-down” approaches, we curate a large corpus of data, which can be used to train GPMs. The “bottom-up” approach starts from a problem definition, and the dataset can be collected via literature mining and experiments. Both approaches can make use of synthetic data to increase the data size and diversity.

2.3.1 Filtering

GPMs are currently trained on very large datasets, enabled by the availability of ever-growing computational resources.[Krizhevsky, Sutskever, and Hinton (2012); Kaplan et al. (2020); Hooker (2020); Dotan and Milli (2019)] The Internet has been the primary source of dataset construction for GPMs. While initially the focus was on training on maximally large datasets, empirical evidence has shown that smaller, higher-quality datasets can lead to better results.[Gunasekar et al. (2023); Marion et al. (2023)] For example, Shao et al. (2024) filtered CommonCrawl for mathematical text using a combination of regular expressions and a custom, iteratively trained classification model[Bojanowski et al. (2017)]. An alternative approach was pursued by Thrush, Potts, and Hashimoto (2024) who introduced a training-free framework. In this method, the pre-training text was chosen by measuring the correlation of each web-domain’s perplexity (a metric that measures how well a language model predicts a sequence of text)—as scored by \(90\) publicly-available LLMs—with downstream benchmark accuracy, keeping only the high-correlation domains.

In the chemical domain, ChemPile[Mirza et al. (2025)] is the only open-source, pre-training scale dataset, that underwent several filtering steps. For example, a large subset of the papers in ChemPile-Paper come from the Europe PMC dataset. To filter for chemistry papers, a custom classification model was trained from scratch using topic-labeled data from the CAMEL[Li et al. (2023)] dataset. To evaluate the accuracy of the model (\(\text{F1-score}=0.77\)), expert-annotated data was used.

2.3.2 Synthetic Data

Instead of only relying on existing datasets, one can also leverage techniques for generating synthetic data. Generation of synthetic data is often required to augment scarce real-world data, but can also be used to achieve the desired model behavior (e.g., invariance in image-based models).

These approaches can be grouped into rule-based and generative methods. Rule-based methods apply manually defined transformations—such as rotations and mirroring—to present different representations of the same instance to a model. In contrast, generative augmentation creates new data by applying transformations learned through a ML model.

2.3.2.1 Rule-based augmentation

The transformations applied for generating new data in rule-based approaches vary depending on the modality (e.g., image, text, or audio). The most common application of rule-based techniques is on images, via image transformations such as distortion, rotation, blurring, or cropping.[Shorten and Khoshgoftaar (2019)] In chemistry, tools like RanDepict[Brinkhaus et al. (2022)] have been used to create enriched datasets of chemical representations. These tools generate human-like drawings of chemical structures that mimic the common illustrations found in scientific literature or even in patents (e.g., by applying image templates from different publishers, or emulating the style of older manuscripts) and further augment them using conventional image-augmentation techniques.

Rule-based augmentations can also be applied to text. Early approaches involved simple operations like random word swapping, random synonym replacement, and random deletions or insertions, which are often labeled “easy augmentation” methods.[Shorten, Khoshgoftaar, and Furht (2021); Wei and Zou (2019)]

In chemistry, text templates have been used.[Xie et al. (2023); Mirza et al. (2025); Jablonka et al. (2024); Van Herck et al. (2025)] Such templates define a sentence structure with semantically configurable fields, which are then filled using structured tabular data. However, it is still unclear how to best construct such templates, as studies have shown that the same data shown in different templates can lead to distinct generalization behavior.[Gonzales et al. (2024)]

We can also apply rule-based augmentation for specific molecular representations (for more details on representations see Section 3.2.1). For example, the same molecule can be represented with multiple different, yet valid simplified molecular input line entry system (SMILES) strings. Bjerrum (2017) used this technique to augment a predictive model, where multiple SMILES strings were mapped to a single property. When averaging the predictions over multiple SMILES strings, at least a \(10\%\) improvement was observed compared to their single SMILES counterparts. Such techniques can be applied to other molecular representations (e.g., International Union of Pure and Applied Chemistry (IUPAC) names or self-referencing embedded strings (SELFIES)), but historically, SMILES has been used more often. As a result, its augmentations have been studied more extensively.[Kimber, Gagnebin, and Volkamer (2021); Born et al. (2023); Arús-Pous et al. (2019)]

A broad array of augmentation techniques has been applied to spectral data—from simple noise addition[Ke et al. (2018); Moreno-Barea et al. (2022)] to physics-informed augmentations (e.g., through DFT simulations).[Oviedo et al. (2019); Gao et al. (2020)]

2.3.2.2 Generative augmentation

In some cases, however, it is not possible to write down the rules. For instance, it is not obvious how text can be transformed into different styles using rules alone. Recent advances in deep learning have facilitated another, more flexible, approach to synthetic data generation. [Maini et al. (2024)] A simple technique is to apply contextual augmentation [Kobayashi (2018)], which implies the sampling of synonyms from a probability distribution of a language model (LM). Another technique is “back translation”,[Edunov et al. (2018)] a process in which text is translated to another language and then back into the original language to generate semantically similar variants. While this technique is typically used within the same language,[Lu et al. (2024)] it can also be extended to multilingual setups[Hong et al. (2024)].

Other recent approaches have harnessed auto-formalization[Wu et al. (2022)], a LLM-powered approach that can turn natural-language mathematical proofs into computer-verifiable mathematical languages such as Lean[De Moura et al. (2015)] or Isabelle[Wenzel, Paulson, and Nipkow (2008)]. Such datasets have been utilized to advance mathematical capabilities in LMs.[Xin et al. (2024); Trinh et al. (2024)]

A drawback of generatively augmented data is that its validity is cumbersome to assess at scale, unless it can be verified automatically by a computer program. For example, it was demonstrated that an increasing ratio of synthetic data can facilitate model collapse.[Kazdan et al. (2024); Shumailov et al. (2024)]

Having reviewed the data sources and generation methods, we will, in the following, discuss architectures and training approaches for GPMs.

Alberts, Bruce. 2002. Molecular Biology of the Cell. 4th ed. Garland Science.

Arús-Pous, Josep, Simon Viet Johansson, Oleksii Prykhodko, Esben Jannik Bjerrum, Christian Tyrchan, Jean-Louis Reymond, Hongming Chen, and Ola Engkvist. 2019. “Randomized SMILES Strings Improve the Quality of Molecular Generative Models.” Journal of Cheminformatics 11: 1–13. https://doi.org/10.1186/s13321-019-0393-0.

Bjerrum, Esben Jannik. 2017. “SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules.” arXiv Preprint arXiv:1703.07076. https://doi.org/10.48550/arXiv.1703.07076.

Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. “Enriching Word Vectors with Subword Information.” Transactions of the Association for Computational Linguistics 5: 135–46. https://doi.org/10.1162/tacl_a_00051.

Born, Jannis, Greta Markert, Nikita Janakarajan, Talia B Kimber, Andrea Volkamer, Marı́a Rodrı́guez Martı́nez, and Matteo Manica. 2023. “Chemical Representation Learning for Toxicity Prediction.” Digital Discovery 2 (3): 674–91. https://doi.org/10.1039/d2dd00099g.

Brinkhaus, Henning Otto, Kohulan Rajan, Achim Zielesny, and Christoph Steinbeck. 2022. “RanDepict: Random chemical structure depiction generator.” Journal of Cheminformatics 14 (1): 31. https://doi.org/10.1186/s13321-022-00609-4.

Cassani, Andrea, Alessandro Monteverde, and Marco Piumetti. 2021. “Belousov–Zhabotinsky Type Reactions: The Non-Linear Behavior of Chemical Systems.” Journal of Mathematical Chemistry 59 (3): 792–826. https://doi.org/10.1007/s10910-021-01223-9.

“Common Crawl.” 2024. https://commoncrawl.org.

De Moura, Leonardo, Soonho Kong, Jeremy Avigad, Floris Van Doorn, and Jakob von Raumer. 2015. “The Lean theorem prover (system description).” Automated Deduction-CADE-25: 25th International Conference on Automated Deduction, 378–88. https://doi.org/10.1007/978-3-319-21401-6_26.

Dotan, Ravit, and S. Milli. 2019. “Value-Laden Disciplinary Shifts in Machine Learning.” FAT*. https://doi.org/10.1145/3351095.3373157.

Edunov, Sergey, Myle Ott, Michael Auli, and David Grangier. 2018. “Understanding Back-Translation at Scale.” arXiv Preprint. https://doi.org/10.48550/arXiv.1808.09381.

Gao, Peng, Jun Zhang, Qian Peng, Jie Zhang, and Vassiliki-Alexandra Glezakou. 2020. “General Protocol for the Accurate Prediction of Molecular 13C/1H NMR Chemical Shifts via Machine Learning Augmented DFT.” Journal of Chemical Information and Modeling 60 (8): 3746–54.

Gonzales, Carmelo, Michael Martin Pieler, Kevin Maik Jablonka, and Santiago Miret. 2024. “Evaluating Chemistry Prompts for Large-Language Model Fine-Tuning.” AI for Accelerated Materials Design - NeurIPS 2024. https://openreview.net/forum?id=cEkUia8neA.

Grattafiori, Aaron, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, et al. 2024. “The Llama 3 Herd of Models.” arXiv Preprint arXiv: 2407.21783. https://doi.org/10.48550/arXiv.2407.21783.

Gunasekar, Suriya, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, et al. 2023. “Textbooks Are All You Need.” arXiv Preprint arXiv: 2306.11644. https://doi.org/10.48550/arXiv.2306.11644.

Guo, Daya, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, et al. 2025. “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.” arXiv Preprint arXiv:2501.12948. https://doi.org/10.48550/arXiv.2501.12948.

Heidorn, P Bryan. 2008. “Shedding light on the dark data in the long tail of science.” Library Trends 57 (2): 280–99. https://doi.org/10.1353/lib.0.0036.

Hong, Kung Yin, Lifeng Han, Riza Batista-Navarro, and Goran Nenadic. 2024. “CantonMT: Cantonese to English NMT Platform with Fine-Tuned Models Using Synthetic Back-Translation Data.” arXiv Preprint arXiv: 2403.11346. https://doi.org/10.48550/arXiv.2403.11346.

Hooker, Sara. 2020. “The Hardware Lottery.” Communications of the ACM. https://doi.org/10.1145/3467017.

HyMARC. 2019. “Hydrogen Storage Materials Database.” https://www.hymarc.org/home.

Jablonka, Kevin Maik, Philippe Schwaller, Andres Ortega-Guerrero, and Berend Smit. 2024. “Leveraging large language models for predictive chemistry.” Nature Machine Intelligence 6 (2): 161–69. https://doi.org/10.1038/s42256-023-00788-1.

Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. “Scaling Laws for Neural Language Models.” arXiv Preprint arXiv: 2001.08361. https://doi.org/10.48550/arXiv.2001.08361.

Kazdan, Joshua, Rylan Schaeffer, Apratim Dey, Matthias Gerstgrasser, Rafael Rafailov, David L. Donoho, and Sanmi Koyejo. 2024. “Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World.” arXiv Preprint arXiv: 2410.16713. https://doi.org/10.48550/arXiv.2410.16713.

Ke, T-W, Aaron S Brewster, Stella X Yu, Daniela Ushizima, Chao Yang, and Nicholas K Sauter. 2018. “A Convolutional Neural Network-Based Screening Tool for x-Ray Serial Crystallography.” Synchrotron Radiation 25 (3): 655–70.

Kearnes, Steven M., Michael R. Maser, Michael Wleklinski, Anton Kast, Abigail G. Doyle, Spencer D. Dreher, Joel M. Hawkins, Klavs F. Jensen, and Connor W. Coley. 2021. “The Open Reaction Database.” J. Am. Chem. Soc. 143 (45): 18820–26. https://doi.org/10.1021/jacs.1c09820.

Kimber, Talia B, Maxime Gagnebin, and Andrea Volkamer. 2021. “Maxsmi: Maximizing Molecular Property Prediction Performance with Confidence Estimation Using Smiles Augmentation and Deep Learning.” Artificial Intelligence in the Life Sciences 1: 100014. https://doi.org/10.1016/j.ailsci.2021.100014.

Kobayashi, Sosuke. 2018. “Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations.” arXiv Preprint arXiv: 1805.06201. https://doi.org/10.48550/arXiv.1805.06201.

Koziarski, Andrei, Michałand Rekesh, Dmytro Shevchuk, Almer van der Sloot, Piotr Gaiński, Yoshua Bengio, Chenghao Liu, Mike Tyers, and Robert Batey. 2024. “RGFN: Synthesizable Molecular Generation Using GFlowNets.” Advances in Neural Information Processing Systems 37: 46908–55. https://doi.org/10.48550/arXiv.2406.08506.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “Imagenet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems 25. https://doi.org/10.1145/3065386.

Li, Guohao, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. “CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society.” arXiv Preprint arXiv: 2303.17760. https://doi.org/10.48550/arXiv.2303.17760.

Lu, Zimu, Aojun Zhou, Houxing Ren, Ke Wang, Weikang Shi, Junting Pan, Mingjie Zhan, and Hongsheng Li. 2024. “MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs.” Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.48550/arXiv.2402.16352.

Maini, Pratyush, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly. 2024. “Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling.” arXiv Preprint arXiv: 2401.16380. https://doi.org/10.48550/arXiv.2401.16380.

Marion, Max, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, and Sara Hooker. 2023. “When Less Is More: Investigating Data Pruning for Pretraining LLMs at Scale.” arXiv Preprint arXiv: 2309.04564. https://doi.org/10.48550/arXiv.2309.04564.

Martin, Stephen F. 2022. “Bridging known and unknown unknowns: From natural products and their mimics to unmet needs in neuroscience.” Accounts of Chemical Research 55 (17): 2397–2408. https://doi.org/10.1021/acs.accounts.1c00773.

Mirza, Adrian, Nawaf Alampara, Martiño Rı́os-Garcı́a, Mohamed Abdelalim, Jack Butler, Bethany Connolly, Tunca Dogan, et al. 2025. “ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models.” arXiv Preprint arXiv: 2505.12534. https://doi.org/10.48550/arXiv.2505.12534.

Moreno-Barea, Francisco J, Leonardo Franco, David Elizondo, and Martin Grootveld. 2022. “Application of Data Augmentation Techniques Towards Metabolomics.” Computers in Biology and Medicine 148: 105916.

Musil, Felix, Andrea Grisafi, Albert P. Bartók, Christoph Ortner, Gábor Csányi, and Michele Ceriotti. 2021. “Physics-Inspired Structural Representations for Molecules and Materials.” Chemical Reviews 121 (16): 9759–9815. https://doi.org/10.1021/acs.chemrev.1c00021.

Oviedo, Felipe, Zekun Ren, Shijing Sun, Charles Settens, Zhe Liu, Noor Titan Putri Hartono, Savitha Ramasamy, et al. 2019. “Fast and Interpretable Classification of Small x-Ray Diffraction Datasets Using Data Augmentation and Deep Neural Networks.” Npj Computational Materials 5 (1): 60.

Penedo, Guilherme, Hynek Kydlı́ček, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. 2024. “The fineweb datasets: Decanting the web for the finest text data at scale.” Advances in Neural Information Processing Systems 37: 30811–49. https://doi.org/10.48550/arXiv.2406.17557.

Penedo, Guilherme, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro Cappelli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. “The Refinedweb Dataset for Falcon Llm: Outperforming Curated Corpora with Web Data Only.” Advances in Neural Information Processing Systems 36: 79155–72. https://doi.org/10.48550/arXiv.2306.01116.

Pieler, Michael, Marco Bellagente, Hannah Teufel, Duy Phung, Nathan Cooper, Jonathan Tow, Paulo Rocha, et al. 2024. “Rephrasing Natural Text Data with Different Languages and Quality Levels for Large Language Model Pre-Training.” arXiv Preprint arXiv:2410.20796. https://doi.org/10.48550/arXiv.2410.20796.

Pietsch, Wolfgang, and Jörg Wernecke. 2017. “Introduction: Ten Theses on Big Data and Computability.” In Berechenbarkeit Der Welt?, 37–57. Springer Fachmedien Wiesbaden. https://doi.org/10.1007/978-3-658-12153-2_2.

Ramakrishnan, Raghunathan, Pavlo O Dral, Matthias Rupp, and O Anatole Von Lilienfeld. 2014. “Quantum chemistry structures and properties of 134 kilo molecules.” Scientific Data 1 (1): 1–7. https://doi.org/10.1038/sdata.2014.22.

Rulev, Alexander Yu. 2017. “Serendipity or the art of making discoveries.” New Journal of Chemistry 41 (11): 4262–68. https://doi.org/10.1039/c7nj00182g.

Scheidgen, Markus, Lauri Himanen, Alvin Noe Ladines, David Sikter, Mohammad Nakhaee, Ádám Fekete, Theodore Chang, et al. 2023. “NOMAD: A distributed web-based platform for managing materials science research data.” Journal of Open Source Software 8 (90): 5388. https://doi.org/10.21105/joss.05388.

Shao, Zhihong, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, et al. 2024. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv Preprint arXiv: 2402.03300. https://doi.org/10.48550/arXiv.2402.03300.

Shorten, Connor, and Taghi M Khoshgoftaar. 2019. “A survey on image data augmentation for deep learning.” Journal of Big Data 6 (1): 1–48. https://doi.org/10.1186/s40537-019-0197-0.

Shorten, Connor, Taghi M Khoshgoftaar, and Borko Furht. 2021. “Text data augmentation for deep learning.” Journal of Big Data 8 (1): 101. https://doi.org/10.1186/s40537-021-00492-0.

Shumailov, Ilia, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. 2024. “AI Models Collapse When Trained on Recursively Generated Data.” Nature 631 (8022): 755–59. https://doi.org/10.1038/s41586-024-07566-y.

Thrush, Tristan, Christopher Potts, and Tatsunori Hashimoto. 2024. “Improving Pretraining Data Using Perplexity Correlations.” arXiv Preprint arXiv:2409.05816. https://doi.org/10.48550/arXiv.2409.05816.

Trinh, Trieu H, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. 2024. “Solving olympiad geometry without human demonstrations.” Nature 625 (7995): 476–82. https://doi.org/10.1038/s41586-023-06747-5.

Unke, Oliver T, Stefan Chmiela, Huziel E Sauceda, Michael Gastegger, Igor Poltavsky, Kristof T Schutt, Alexandre Tkatchenko, and Klaus-Robert Muller. 2021. “Machine learning force fields.” Chemical Reviews 121 (16): 10142–86. https://doi.org/10.1021/acs.chemrev.0c01111.

Van Herck, Joren, Marı́a Victoria Gil, Kevin Maik Jablonka, Alex Abrudan, Andy S. Anker, Mehrdad Asgari, Ben Blaiszik, et al. 2025. “Assessment of fine-tuned large language models for real-world chemistry and material science applications.” Chemical Science 16 (2): 670–84. https://doi.org/10.1039/D4SC04401K.

Wei, Jason, and Kai Zou. 2019. “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks.” arXiv Preprint. https://doi.org/10.48550/arXiv.1901.11196.

Wenzel, Makarius, Lawrence C Paulson, and Tobias Nipkow. 2008. “The isabelle framework.” International Conference on Theorem Proving in Higher Order Logics, 33–38. https://doi.org/10.1007/978-3-540-71067-7_7.

Wilson, Andrew Gordon. 2025. “Deep Learning is Not So Mysterious or Different.” arXiv Preprint arXiv: 2503.02113. https://doi.org/10.48550/arXiv.2503.02113.

Wu, Yuhuai, Albert Qiaochu Jiang, Wenda Li, Markus Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. 2022. “Autoformalization with Large Language Models.” Advances in Neural Information Processing Systems 35: 32353–68. https://doi.org/10.48550/arXiv.2205.12615.

Xie, Tong, Yuwei Wan, Wei Huang, Zhenyu Yin, Yixuan Liu, Shaozhou Wang, Qingyuan Linghu, et al. 2023. “Darwin series: Domain specific large language models for natural science.” arXiv Preprint arXiv:2308.13565. https://doi.org/10.48550/arXiv.2308.13565.

Xin, Huajian, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. 2024. “Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.” arXiv Preprint arXiv:2405.14333. https://doi.org/10.48550/arXiv.2405.14333.

Zunger, Alex. 2019. “Beware of plausible predictions of fantasy materials.” Nature 566 (7745): 447–49. https://doi.org/10.1038/d41586-019-00676-y.