2  The Shape and Structure of Chemical Data

2.1 Shape of Scientific Data

Data is the essential asset in data-driven techniques. To understand the successes and failures of machine learning (ML) models, it is instructive to explore how the structure of different datasets shapes the learning capabilities of different models. One useful lens for doing so is to consider how complex a system is (i.e., how many variables are needed to describe it) and what fraction of these variables are explicit. One might see the set of variables that are required to describe a system as the state space. A state space encompasses all possible states of a system, similar to concepts in statistical mechanics (SM). However, in contrast to many other problems, we often cannot explicitly enumerate all variables and their potential values in relevant chemical systems. Commonly, many of the essential factors describing a system are implicit (“known unknowns” or “unknown unknowns”).

2.1.1 Irreducible complexity

Figure 2.1 illustrates how the state space of chemistry tends to grow more implicit as we move from describing single atoms or small molecules in vacuo, to real-world systems. For instance, we can completely explain almost all observed phenomena for a hydrogen atom using the position (and atomic numbers) of the hydrogen atom via the Schrödinger equation. As we scale up to larger systems such as macromolecular structures or condensed phases, we have to deal with more “known unknowns” and “unknown unknowns”.[Martin (2022)] For example, it is currently impossible to model a full packed-bed reactor at the atomistic scale because the size of the problem scales with the number of parameters that can be tuned. Often, it becomes infeasible to explicitly label all variables and their values. We can describe such complexity as “irreducible”[Pietsch and Wernecke (2017)], in contrast to “emergent” complexity that emerges from systems that can be described with simple equations, such as a double pendulum.

Figure 2.1: State space description for chemistry at different scales. We illustrate how the number of hidden variables (gray) is growing with scale and complexity. For simple systems we can explicitly write down all variables and their values and perfectly describe the system. For more complex systems—closer to practical applications—we can no longer do that. Many more variables cannot be explicitly enumerated.

For phenomena characterized by irreducible complexity, success is often serendipitous. As pointed out by Rulev (2017), chemical literature commonly contains terms such as “to our surprise”, “remarkable reactivity”, or “unusual performance”, which may reflect the complexity of scientific questions and the diminishing explainability of observed results.

2.1.1.1 Emergent complexity

In contrast to irreducible complexity, there is a subset of chemical problems for which all relevant parameters can explicitly be listed, but the complexity emerges from the intricate, potentially chaotic, interactions among them. A well-known example is the Belousov-Zhabotinsky reaction, [Cassani, Monteverde, and Piumetti (2021)] which exhibits oscillations and pattern formation as a result of a complex chemical reaction network. Individual chemical reactions within the network are simple, but their interactions create a dynamic, self-organizing system with properties not seen in the individual components. An example of how fast the parameter space can grow was provided by Koziarski et al. (2024), who show that a single reaction type and a few hundred molecular building blocks can create tens of thousands of possible solutions. When scaling up to only five reaction types, the exploration of the entire space can become intractable, estimated at approximately \(10^{22}\) solutions. When optimization objectives are involved—finding the shortest synthesis pathway or maximizing the yield—such problems are often NP-hard, meaning that no known polynomial-time algorithms can guarantee optimal solutions, though various heuristic and approximation methods can provide good solutions.

Knowing the ratio between explicit and implicit parameters helps in selecting the appropriate model architecture. If most of the variance is caused by explicit factors, these can be incorporated as priors or constraints in the model, thereby increasing data efficiency. This strategy can, for instance, be applied in the development of force fields where we know the governing equations and their symmetries, and can use them to enforce such symmetries in the model architecture (as hard restrictions to a family of solutions). [Unke et al. (2021); Musil et al. (2021)] However, when the variance is dominated by implicit factors, such constraints can no longer be formulated, as the governing relationships are not known. In those cases, flexible general-purpose model (GPM)s with soft inductive biases—which guide the model toward preferred solutions without enforcing strict constraints on the solution space[Wilson (2025)]—are more suitable. large language model (LLM)s fall into this category.

2.2 Scale of Chemical Data

Chemistry is an empirical science in which every prediction bears the burden of proof through experimental validation.[Zunger (2019)] However, there is often a mismatch between the realities of a chemistry lab and the datasets on which ML models for chemistry are trained. Much of current data-driven modeling in chemistry focuses on a few large, structured, and highly-curated datasets where most of the variance is explicit (reducible complexity). Such datasets, QM9 for example,[Ramakrishnan et al. (2014)] often come from quantum-chemical computations. Experimental chemistry, however, tends to have a significantly higher variance and a greater degree of irreducible complexity. In addition, since data generation is often expensive, datasets are small, and because science is about doing new things for the first time, many datasets also contain at least some unique variables.

Considering the largest chemistry text dataset, ChemPile,[Mirza et al. (2025)] we can look at how the largest subsets fare in comparison to the smallest ones (see Table 2.1). The largest dataset is approximately three million times larger than the smallest one.

Table 2.1: Token counts for the three largest and smallest datasets in the ChemPile[Mirza et al. (2025)] collection. Dominating datasets contribute a large portion of the total token count (a token represents the smallest unit of text that a ML model can process), with the small datasets significantly increasing the diversity.
Dataset Token count
Three largest ChemPile datasets
NOMAD crystal structures[Scheidgen et al. (2023)] 5,808,052,794
Open Reaction Database (ORD)[Kearnes et al. (2021)] reaction prediction 5,347,195,320
RDKit molecular features 5,000,435,822
Three smallest ChemPile datasets
Hydrogen storage materials[HyMARC (2019)] 1,935
List of amino-acids[Alberts (2002)] 6,000
ORD[Kearnes et al. (2021)] recipe yield prediction 8,372

The prevalence of many small, specialized datasets over large ones is commonly referred to as “the long tail problem”.[Heidorn (2008)]

Figure 2.2: Cumulative token count based on the ChemPile tabular datasets (Mirza et al. 2025). We compare the approximate token count for three datasets: Llama-3 training dataset,(Grattafiori et al. 2024) openly available chemistry papers in the ChemPile-Paper dataset, and the ChemPile-LIFT dataset. As can be seen, by aggregating the collection of tabular datasets converted to text format in the ChemPile-LIFT subset, we can achieve the same order of magnitude as the collection of open chemistry papers. However, without smaller datasets, we cannot capture the breadth and complexity of chemistry data, which is essential for training GPM. The tokenization method for both ChemPile and Llama-3 is provided in the respective papers.

This can be seen in Figure 2.2. We show that while a few datasets are large, the majority of the corpus consists of small but collectively significant and chemically diverse datasets. The actual tail of chemical data is even larger, as Figure 2.2 only shows the distribution for manually curated tabular datasets and not all data actually created in the chemical sciences. Given that every dataset in the long tail relies on unique sources of variance—it is very difficult to leverage this long tail with conventional ML techniques. However, the promise of GPMs such as LLMs is that they can very flexibly integrate and jointly model the diversity of small datasets that exist in the chemical sciences.

2.3 Dataset Creation

Before a model can be trained or tested, suitable data must first be collected. It is important to note that when working with GPMs, data can—but should not—be directly ingested in a raw format and requires some form of pre-processing. Training GPMs typically requires a large and diverse dataset, compiled in a form that can be efficiently ingested by the training pipeline.

Strategies for doing so can be broadly categorized into two groups (see Figure 2.3). One can utilize a “top-down” approach where a large and diverse pool of data—e.g., results from web-crawled resources such as CommonCrawl[“Common Crawl” (2024)]—is filtered using custom-built procedures (e.g., using regular expressions or classification models). This approach is gaining traction in the development of foundation models such as LLMs.[Penedo et al. (2023); Penedo et al. (2024); Guo et al. (2025)] Alongside large filtered datasets, various data augmentation techniques have further increased the performance of GPMs.[Maini et al. (2024); Pieler et al. (2024)]

Alternatively, one can take a “bottom-up” approach by specifically creating novel datasets for a given problem—an approach which has been very popular in ML for chemistry.

In practice, a combination of both approaches is often used. In most cases, key techniques include filtering and the generation of synthetic data.

Figure 2.3: Dataset creation protocols. In “top-down” approaches, we curate a large corpus of data, which can be used to train GPMs. The “bottom-up” approach starts from a problem definition, and the dataset can be collected via literature mining and experiments. Both approaches can make use of synthetic data to increase the data size and diversity.

2.3.1 Filtering

GPMs are currently trained on very large datasets, enabled by the availability of ever-growing computational resources.[Krizhevsky, Sutskever, and Hinton (2012); Kaplan et al. (2020); Hooker (2020); Dotan and Milli (2019)] The Internet has been the primary source of dataset construction for GPMs. While initially the focus was on training on maximally large datasets, empirical evidence has shown that smaller, higher-quality datasets can lead to better results.[Gunasekar et al. (2023); Marion et al. (2023)] For example, Shao et al. (2024) filtered CommonCrawl for mathematical text using a combination of regular expressions and a custom, iteratively trained classification model[Bojanowski et al. (2017)]. An alternative approach was pursued by Thrush, Potts, and Hashimoto (2024) who introduced a training-free framework. In this method, the pre-training text was chosen by measuring the correlation of each web-domain’s perplexity (a metric that measures how well a language model predicts a sequence of text)—as scored by \(90\) publicly-available LLMs—with downstream benchmark accuracy, keeping only the high-correlation domains.

In the chemical domain, ChemPile[Mirza et al. (2025)] is the only open-source, pre-training scale dataset, that underwent several filtering steps. For example, a large subset of the papers in ChemPile-Paper come from the Europe PMC dataset. To filter for chemistry papers, a custom classification model was trained from scratch using topic-labeled data from the CAMEL[Li et al. (2023)] dataset. To evaluate the accuracy of the model (\(\text{F1-score}=0.77\)), expert-annotated data was used.

2.3.2 Synthetic Data

Instead of only relying on existing datasets, one can also leverage techniques for generating synthetic data. Generation of synthetic data is often required to augment scarce real-world data, but can also be used to achieve the desired model behavior (e.g., invariance in image-based models).

These approaches can be grouped into rule-based and generative methods. Rule-based methods apply manually defined transformations—such as rotations and mirroring—to present different representations of the same instance to a model. In contrast, generative augmentation creates new data by applying transformations learned through a ML model.

2.3.2.1 Rule-based augmentation

The transformations applied for generating new data in rule-based approaches vary depending on the modality (e.g., image, text, or audio). The most common application of rule-based techniques is on images, via image transformations such as distortion, rotation, blurring, or cropping.[Shorten and Khoshgoftaar (2019)] In chemistry, tools like RanDepict[Brinkhaus et al. (2022)] have been used to create enriched datasets of chemical representations. These tools generate human-like drawings of chemical structures that mimic the common illustrations found in scientific literature or even in patents (e.g., by applying image templates from different publishers, or emulating the style of older manuscripts) and further augment them using conventional image-augmentation techniques.

Rule-based augmentations can also be applied to text. Early approaches involved simple operations like random word swapping, random synonym replacement, and random deletions or insertions, which are often labeled “easy augmentation” methods.[Shorten, Khoshgoftaar, and Furht (2021); Wei and Zou (2019)]

In chemistry, text templates have been used.[Xie et al. (2023); Mirza et al. (2025); Jablonka et al. (2024); Van Herck et al. (2025)] Such templates define a sentence structure with semantically configurable fields, which are then filled using structured tabular data. However, it is still unclear how to best construct such templates, as studies have shown that the same data shown in different templates can lead to distinct generalization behavior.[Gonzales et al. (2024)]

We can also apply rule-based augmentation for specific molecular representations (for more details on representations see Section 3.2.1). For example, the same molecule can be represented with multiple different, yet valid simplified molecular input line entry system (SMILES) strings. Bjerrum (2017) used this technique to augment a predictive model, where multiple SMILES strings were mapped to a single property. When averaging the predictions over multiple SMILES strings, at least a \(10\%\) improvement was observed compared to their single SMILES counterparts. Such techniques can be applied to other molecular representations (e.g., International Union of Pure and Applied Chemistry (IUPAC) names or self-referencing embedded strings (SELFIES)), but historically, SMILES has been used more often. As a result, its augmentations have been studied more extensively.[Kimber, Gagnebin, and Volkamer (2021); Born et al. (2023); Arús-Pous et al. (2019)]

A broad array of augmentation techniques has been applied to spectral data—from simple noise addition[Ke et al. (2018); Moreno-Barea et al. (2022)] to physics-informed augmentations (e.g., through DFT simulations).[Oviedo et al. (2019); Gao et al. (2020)]

2.3.2.2 Generative augmentation

In some cases, however, it is not possible to write down the rules. For instance, it is not obvious how text can be transformed into different styles using rules alone. Recent advances in deep learning have facilitated another, more flexible, approach to synthetic data generation. [Maini et al. (2024)] A simple technique is to apply contextual augmentation [Kobayashi (2018)], which implies the sampling of synonyms from a probability distribution of a language model (LM). Another technique is “back translation”,[Edunov et al. (2018)] a process in which text is translated to another language and then back into the original language to generate semantically similar variants. While this technique is typically used within the same language,[Lu et al. (2024)] it can also be extended to multilingual setups[Hong et al. (2024)].

Other recent approaches have harnessed auto-formalization[Wu et al. (2022)], a LLM-powered approach that can turn natural-language mathematical proofs into computer-verifiable mathematical languages such as Lean[De Moura et al. (2015)] or Isabelle[Wenzel, Paulson, and Nipkow (2008)]. Such datasets have been utilized to advance mathematical capabilities in LMs.[Xin et al. (2024); Trinh et al. (2024)]

A drawback of generatively augmented data is that its validity is cumbersome to assess at scale, unless it can be verified automatically by a computer program. For example, it was demonstrated that an increasing ratio of synthetic data can facilitate model collapse.[Kazdan et al. (2024); Shumailov et al. (2024)]

Having reviewed the data sources and generation methods, we will, in the following, discuss architectures and training approaches for GPMs.