6 Accelerating Applications

The application of accelerated approaches in the scientific discovery cycle (see Figure 5.1) hinges on their ability to streamline and enhance each stage of the process. However, a fundamental challenge in effectively implementing these approaches lies in the choice of machine-readable representation.

This challenge is particularly evident in the representation of molecules and materials, which must balance computational efficiency with the preservation of structural, compositional, and functional properties. Take, for example, the high-temperature superconductor YBa2Cu3O_7-x. While atomic positions and coordinates are theoretically sufficient to solve the Schrödinger equation and describe this material, such a representation may not provide the adaptability necessary for diverse tasks. What defines a good representation depends on the problem. [Huang and Lilienfeld (2016)]. A representation designed to predict critical temperature must efficiently encode the relationship between oxygen stoichiometry and superconducting properties, emphasizing features like oxygen vacancy patterns and charge transfer mechanisms. Conversely, a representation for structural stability might prioritize different geometric or bonding characteristics.

This tension has led to three primary strategies for representing molecules and materials (read Section 3.2.1 to learn in detail about the different representations that currently exist). First, domain-specific text-based formats—such as simplified molecular input line entry system (SMILES) [Weininger (1988)], self-referencing embedded strings (SELFIES) [Krenn et al. (2020)], and crystallographic information file (CIF) [Hall, Allen, and Brown (1991)]—offer compact, machine-readable encodings of structural information. While these necessarily omit certain physical details, their computational tractability has enabled breakthroughs, as demonstrated by Jablonka et al. (2024) in their large language model (LLM)-based generation of valid molecular and material structures.

Yet, the question remains: Which representation is optimal for a given task? Future advances in accelerated discovery will likely hinge on adaptive representations that dynamically balance these competing demands.

6.1 Property Prediction

general-purpose model (GPM)s have emerged as a powerful tool for predicting molecular and material properties, offering an alternative to traditional quantum mechanical calculations or specialized machine learning (ML) models. Current GPM-driven property prediction tasks span both classification and regression. Unlike conventional approaches that rely on task-specific architectures and extensively labeled data, GPMs have demonstrated strong generalization capabilities across diverse domains, efficiently adapting to various prediction tasks. Their success extends to multiple datasets, from standardized benchmarks such as MoleculeNet [Wu et al. (2018)], to curated datasets targeting specific applications such as antibacterial activity [Chithrananda, Grand, and Ramsundar (2020)] or photovoltaic efficiency[Aneesh et al. (2025)].

Three key methodologies have been explored to adapt LLMs for property prediction: prompting techniques (see Section 3.11.1), fine-tuning (see Section 3.11.1.1) on domain-specific data, and retrieval-augmented generation (RAG) (see Section 3.12.1.3) approaches that combine LLMs with external knowledge bases.

Table 6.1: Non-comprehensive list of s applied to property-prediction tasks.The table presents different models and their applications across different molecular and materials property prediction benchmarks, showing the diversity of properties (from molecular toxicology to crystal band gaps), datasets used for evaluation, modeling approaches (prompting, fine-tuning, or retrieval-augmented generation), and task types (classification or regression.)

Model	Property	Dataset	Approach	Task
GPT-Chem (Jablonka et al. 2024)	HOMO/LUMO	QMUGs (Isert et al. 2022)	FT	C, R
	Solubility	DLS-100 (Mitchell 2017)	FT	C, R
	Lipophilicity	LipoData (Jablonka et al. 2024)	FT	C, R
	Hydration Free Energy	FreeSolv (Mobley and Guthrie 2014)	FT	C, R
	Photoconversion Efficiency	OPV (Jablonka et al. 2024)	FT	C, R
	Toxicology	Tox21 (Richard et al. 2021)	FT	C, R
	\(\text{CO}_{2}\) Henry coeff. of MOFs	MOFSorb-H (L.-C. Lin et al. 2012)	FT	C, R
LLM-Prop (Rubungo et al. 2023)	Band gap	CrystalFeatures-MP2022 (Rubungo et al. 2023)	P	R
	Volume	CrystalFeatures-MP2022 (Rubungo et al. 2023)	P	R
	Is the band gap direct?	CrystalFeatures-MP2022 (Rubungo et al. 2023)	P	C
LLM4SD (Zheng et al. 2025)	Blood-brain-barrier penetration	BBBP (Sakiyama, Fukuda, and Okuno 2021)	P	C
	FDA approval	ClinTox (Wu et al. 2018)	P	C
	Toxicology	Tox21 (Richard et al. 2021)	P	C
	Drug-related side effects	SIDER (Kuhn et al. 2016)	P	C
	HIV replication inhibition	HIV (Wu et al. 2018)	P	C
	β-secretase binding	BACE (Wu et al. 2018)	P	C
	Solubility	ESOL (Wu et al. 2018)	P	R
	Hydration Free Energy	FreeSolv (Mobley and Guthrie 2014)	P	R
	Lipophilicity	Lipophilicity (Wu et al. 2018)	P	R
	Quantum mechanics	QM9 (Wu et al. 2018)	P	R
LLaMP (Chiang et al. 2024)	Bulk modulus	Materials Project (Riebesell et al. 2025)	RAG	R
	Formation energy	Materials Project (Riebesell et al. 2025)	RAG	R
	Electronic band gap	Materials Project (Riebesell et al. 2025)	RAG	R
	Multi-element band gap	Materials Project (Riebesell et al. 2025)	RAG	R

Key: P = prompting; FT = fine-tuned model; RAG = retrieval-augmented generation; C = classification; R = regression

6.1.1 Prompting

Prompt engineering involves designing targeted instructions to guide GPMs in performing specialized tasks without altering their underlying parameters by leveraging their embedded knowledge. In molecular and materials science, this strategy goes beyond simply asking a model to predict properties. It also includes carefully structured prompts to elicit detailed molecular and material descriptions directly from the model’s pre-trained knowledge.

H. Liu et al. (2025) conducted a comprehensive evaluation of different prompting techniques to predict the properties of organic small molecules and crystal materials. Some of these techniques included domain-knowledge (prior knowledge was embedded in the prompt), expert (role-play instructions), and few-shot chain-of-thought (CoT) (the text“Let’s think step by step” is added) prompting. Of these, domain knowledge achieved maximum performance. However, their evaluation was limited to a relatively small set of molecules and tasks, and the effectiveness of their domain-knowledge approach may not generalize to other molecular property domains.

Building on these foundational prompting strategies, few-shot prompting approaches leverage in-context learning (ICL) to enhance performance through selected examples Y. Liu et al. (2024) used SMILES string representations of molecules with few-shot ICL, retrieving structurally similar molecules as demonstrations to enhance property prediction. This approach highlights how ICL can transfer knowledge from similar molecule examples without requiring model fine-tuning for each task. However, the effectiveness of ICL depends on the quality of retrieved examples.

Fifty, Leskovec, and Thrun (2023) moved beyond direct text prompting of molecules and introduced context-aware molecule prediction (CAMP): an ICL algorithm that uses a two-stage encoding approach without relying on pre-trained LLMs. First, a specialized message-passing neural network (MPNN) encodes molecule graphs into molecular embeddings rather than processing them as raw text. These embeddings are then fed into a transformer encoder, which learns contextualized representations across the support set (a small collection of labeled molecule-property pairs) and the unlabeled query molecules. They demonstrated CAMP’s ability to outperform existing few-shot learning baselines by providing relevant molecular examples within the prompt context. However, this approach is constrained by the context-length limitations of the underlying language model (LM)s and the challenge of selecting optimal demonstration examples.

More sophisticated approaches have leveraged prompting as part of multi-modal frameworks. The LLM4SD pipeline by Zheng et al. (2025) employs specialized prompts to guide LMs through their pre-trained knowledge on scientific literature, generating known rules (e.g., molecules weighing under 500 Da are more likely to pass the blood-brain barrier) that transform molecules into feature vectors (e.g. CCO could translate to a vector \([2,46.07,1,1]\) where each number represents a feature of the molecule, in this example [# C, MW, # H-bond donors, # H-bond acceptors]) for use with a random forest model, which they consider “interpretable”. This approach outperformed specialized state-of-the-art (SOTA) models across \(58\) benchmark tasks, while providing interpretable reasoning about prediction logic (see Table 6.1 for properties predicted by this model). However, its reliance on rule extraction may limit its ability to capture complex, non-linear relationships that specialized deep learning models can identify.

6.1.1.1 LLMs as Feature Extractors

Another emerging application of LLMs is their use as “feature extractors”, where they generate textual or embedded representations of molecules or materials. For instance, in materials science, Aneesh et al. (2025) employed LLMs to generate text embeddings of perovskite solar cell compositions. These embeddings were subsequently used to train a graph neural network (GNN) for predicting power conversion efficiency, demonstrating the potential of LLMs to enhance feature representation in materials informatics. Similarly, in the molecular domain, Srinivas and Runkana (2024b) used zero-shot LLM prompting to generate detailed textual descriptions of molecular functional groups, which are used to train a small LM. This LM is used to compute text-level embeddings of molecules. Simultaneously, they generate molecular graph-level embeddings from SMILES string molecular graph inputs. They finally integrate the graph and text-level embeddings to produce a semantically enriched embedding.

In a different implementation of fine-tuning, Balaji et al. (2023) used ChatGPT to generate text descriptions of molecules that were then used to train a RoBERTa (125M) model for property prediction, showing how LM-generated representations can access latent spaces that SMILES strings alone might not capture. Similarly, Z. Li et al. (2024) introduced the MoleX framework, which fine-tunes ChemBERTa-2[Ahmad et al. (2022)] on Group SELFIES [Cheng et al. (2023)] (a functional group-based molecular representation) to then extract a single LLM-derived embedding of molecules that captures the chemical semantics at the functional group level. This allowed them to determine which functional groups or fragments contribute to molecular properties, which in turn can be converted into reliable explanations of said properties.

6.1.2 Fine-Tuning

Figure 6.1: **Fine-tuned `GPT-3` for predicting solid-solution formation in high-entropy alloys** Performance comparison of different ML approaches as a function of the number of training points. Results are shown for `Automatminer` (blue), `CrabNet` transformer (orange), fine-tuned `GPT-3` (red), with error bars showing standard error of the mean. The non-Google test set shows the fine-tuned `GPT-3` model tested on compounds without an exact Google search match (dark red). The dashed line shows performance using random forest. `GPT-3` achieves comparable accuracy to traditional approaches with significantly fewer training examples. Data adapted from Jablonka et al. (2024)

6.1.2.1 language-interfaced finetuning (LIFT)

Dinh et al. (2022) showed that reformulating regression and classification as questions & answers (Q&A) tasks enables the use of unmodified model architecture while improving performance (see Section 3.11.1.1 for a deeper discussion of LIFT). In recognizing the scarcity of experimental data and acknowledging the persistence of this limitation, Jablonka et al. (2024) designed a LIFT-based framework using GPT-3 fine-tuned on task-specific small datasets (see Table 6.1). They seminally demonstrated that fine-tuned GPT-3 can match or surpass specialized ML models in various chemistry tasks. A key finding was fine-tuned GPT-3’s ability to generalize beyond training data. When tested on compounds absent from Google Search (and likely its training data), it performed well, proving that it was not simply recalling memorized information (see Figure 6.1).

In a follow-up to Jablonka et al. (2024)’s work, Van Herck et al. (2025) systematically evaluated this approach across 22 diverse real-world chemistry case studies using three open-source models. They demonstrate that fine-tuned LLMs can effectively predict various material properties. For example, they achieved \(96\%\) accuracy in predicting the adhesive free-energy of polymers, outperforming traditional ML methods like random forest (\(90\%\) accuracy). When predicting properties of monomers using SMILES notation, the fine-tuned models reached average accuracies of \(84\%\) across four different properties. Particularly notable was the ability of LLMs to work with non-standard inputs, like in a protein phase separation study they did, where raw protein sequences could be directly input without pre-processing and achieve \(95\%\) prediction accuracy. At the same time, when training datasets were very small (15 data points), the predictive accuracy of all fine-tuned models was lower than the random baseline (e.g. MOF synthesis). These case studies preliminarily demonstrate that these models can achieve predictive performance with some small datasets, work with various chemical representations (SMILES, metal-organic framework (MOF)id, and International Union of Pure and Applied Chemistry (IUPAC) names), and can outperform traditional ML approaches for some material property prediction tasks.

In the materials domain, LLMprop fine-tunes T5[Raffel et al. (2020)] to predict crystalline material properties from text descriptions generated by Robocrystallographer[Ganose and Jain (2019)]. By discarding T5’s decoder and adding task-specific prediction heads, the approach reduces computational overhead while leveraging the model’s ability to process structured crystal descriptions. The method demonstrates that natural language representations can effectively capture key material features, offering an alternative to traditional graph-based models like GNNs.

Fine-tuning has been used to adapt selective state space model (SSM)s like Mamba (see Section 3.8). By pre-training on 91 million molecules, the Mamba-based model \(\text{O}_{SMI}-{\text{SSM}-}336\textit{M}\) outperformed transformer methods (Yield-BERT[Krzyzanowski, Pickett, and Pogány (2025)]) in reaction yield prediction (e.g., Buchwald-Hartwig cross-coupling) and achieved competitive results in molecular property prediction benchmarks.[Soares et al. (2025)]

6.1.2.2 Foundational GNNs and machine-learning interatomic potential (MLIP)s

The fine-tuning approach has been applied to “foundational GNNs” [Sypetkowski et al. (2024); Shoghi et al. (2023)] and MLIPs, approaches distinct from GPMs. For example, [Shoghi et al. (2023); Sypetkowski et al. (2024)] show SOTA performance on property prediction tasks. “Foundational” MLIPs pre-trained on large datasets encompassing many chemical elements can be fine-tuned for specific downstream tasks [Batatia et al. (2022)], such as calculating sublimation enthalpies of molecular crystal polymorphs [Kaur et al. (2025)].

6.1.2.3 Limitations

One central challenge is finding balance in datasets. In practical applications, researchers often have many more examples of poor-performing materials than optimal ones, resulting in unbalanced datasets that can diminish model performance. Van Herck et al. (2025) point out that in the catalyzed cleavage reaction study, only \(3.8\%\) of catalysts were labeled as “good”, forcing researchers to reduce their training set significantly to maintain balance. They also note that LLMs struggle with highly complex or noisy datasets, as seen in their study of catalytic isomerization, where even after hyperparameter optimization, the models failed to achieve meaningful predictive power due to the high noise in the experimental data and limited sample size. Finally, they note that although LLMs can work with different chemical representations, the choice of representation significantly impacts performance. For example, when predicting polymerization rates, models using SMILES notation significantly outperformed those using IUPAC names, indicating that representation selection remains an important consideration.

Fine-tuning effectively adapts LLMs to specialized chemistry tasks, but its dependence on static datasets hinders adaptability to new or evolving knowledge. RAG, whose fundamentals are described in detail in Section 3.12.1.3, overcomes these limitations by dynamically integrating external data sources, enabling more flexible and up-to-date reasoning.

6.1.3 Agents

Caldas Ramos et al. introduce MAPI-LLM, a framework that processes natural-language queries about material properties using an LLM to decide which of the available tools such as the Materials Project application programming interface (API), the Reaction-Network package, or Google Search to use to generate a response. [Jablonka et al. (2023)] MAPI-LLM employs a reasoning and acting (ReAct) prompt (see Section 3.12.1 to read more about ReAct), to convert prompts such as “Is \(Fe_2O_3\) magnetic?” or “What is the band gap of Mg(Fe3O3)2?” into queries for Materials Project API. The system processes multi-step prompts through logical reasoning, for example, when asked “If Mn2FeO3 is not metallic, what is its band gap?”, the LLM system creates a two-step workflow to first verify metallicity before retrieving the band gap.

Building on this foundation of agent-based materials querying, Chiang et al. (2024) advanced the approach with LLaMP, a framework that employs “hierarchical” ReAct agents to interact with computational and experimental data. This “hierarchical” framework employs a supervisor-assistant agent architecture where a complex problem is broken down and tasks are delegated to domain-specific agents. LLaMP addresses the challenge of hallucinations more effectively than standard LLM approaches by grounding responses in retrieved materials databases, retrieving materials data (e.g., crystal structures, elastic tensors) while counteracting systematic LLM biases in property predictions. These biases include the tendency for LLMs to overestimate certain properties like bulk moduli and to exhibit errors in bandgap predictions based on compositional patterns learned during training rather than physical principles.

6.1.4 Core Limitations

Figure 6.2: **Normalized error distributions for materials property prediction models across different architectures**. Each point represents the normalized error of a model on a specific property prediction task. Normalization was achieved with min/max values of each dataset to produce a range of errors between 0 and 1. The first column (blue) shows GNN based models, the second column (red) displays LLM approaches, and the third column (orange) represents other baseline methods and sota models including `CrabNet`. (A. Y.-T. Wang et al. 2021) Lower values indicate better predictive performance. Data adapted from Alampara, Miret, and Jablonka (2024)

Alampara, Miret, and Jablonka (2024) introduced MatText, a framework for evaluating LMs ability to predict properties of materials using text-based representations. Their findings indicate that current LLMs (including pre-trained BERT and fine-tuned LLaMA-3-8B) are effective for tasks relying purely on compositional information (e.g., element types and local bonding patterns), but struggle to leverage geometric or positional information encoded in text, as reflected in Figure 6.2. This observation suggests that transformer-based architectures may be fundamentally limited to applications where spatial understanding is not required. Their experiments with data scaling and text representations reveal that increasing pre-training data or adding geometric details fails to improve downstream property prediction, challenging the conventional assumption that larger models and datasets universally enhance performance. [Frey et al. (2023)] Notably, Frey et al. (2023) demonstrated power-law scaling in chemical LLMs, but MatText’s results imply that such scaling may not overcome architectural biases against geometric reasoning in materials tasks.[Gruver et al. (2024)]

6.2 Molecular and Material Generation

Figure 6.3: **Pipeline for molecular and materials generation** The workflow begins with input structures represented in various formats, which are used to train ml models to generate novel molecular and material structures. The generated structures should undergo a feedback loop through validation processes before being applied in the real world. Blue boxes indicate well-established areas of the pipeline with mature methodologies, while the red box represents critical bottlenecks.

Early work in molecular and materials generation relied heavily on unconditional generation, where models produce novel structures without explicit guidance, relying solely on patterns learned from training data. For example, latent space sampling in autoencoders, where random vectors are decoded into new structures.[Yoshikai et al. (2024)] These methods excel at exploring chemical space broadly but lack fine-grained control. This limitation underscores the need for conditional generation, using explicit prompts or constraints (e.g., property targets, structural fragments), to steer GPMs toward meaningful molecule or material designs. Beyond the generation step, as Figure 6.3 shows, critical bottlenecks persist in synthesizability and physical consistency at the validation stage.

6.2.1 Generation

6.2.1.1 Prompting

While zero-shot and few-shot prompting strategies demonstrate promising flexibility for molecule generation, benchmark studies [Guo et al. (2023)] reveal significant limitations that restrict their practical utility. Guo et al. (2023) exposed fundamental gaps in LLMs’ molecular design capabilities through a systematic evaluation. GPT-4 was reported to produce chemically valid SMILES \(89\%\) of the time but achieving less than \(20\%\) accuracy in matching the target specifications. This result is far below specialized models like MolT5[Edwards et al. (2022)]. They conclude that this performance gap stems from LLMs’ inadequate understanding of SMILES syntax and structure-property relationships. Subsequent work by Bhattacharya et al. (2024) explored whether systematic prompt engineering could overcome these limitations, demonstrating that these prompts could guide Claude 3 Opus to generate chemically valid molecules (\(97\%\) syntactic validity) with controlled modifications, including fine-grained structural changes (median Tanimoto similarity \(0.67\)–\(0.69\)) and predictable electronic property shifts (0.14 eV–0.27 eV highest occupied molecular orbital (HOMO) energy changes). Hybrid approaches like FrontierX extend this method with knowledge-augmented prompting, where LLMs generate both molecule predictions and explanations that are used to fine-tune smaller LMs, with all resulting embeddings ultimately combined via hierarchical attention mechanisms to produce the final SMILES representation[Srinivas and Runkana (2024a)]. It showed improved accuracy over pure prompting strategies but sacrificed the generalizability that makes LLMs attractive, as the model requires re-training for each new molecular domain.

6.2.1.2 Fine-Tuning

To overcome the limitations of prompting, fine-tuning has been adopted in molecular and materials generation, much like its use in property prediction with LIFT-based frameworks (see Section 3.11.1.1 for a deeper explanation of LIFT and Section 6.1.2 for a discussion of LIFT applied to property prediction tasks). B. Yu et al. (2024) demonstrated that systematic fine-tuning in various chemical tasks including molecule generation from captions can improve performance while remaining parameter-efficient, using only \(0.58\%\) of trainable parameters via low-rank adaptation (LoRA).

The molecule-caption translation task (Mol2Cap), which involves generating textual descriptions from molecular representations and vice versa (Cap2Mol), has become a standard benchmark for evaluating GPMs for molecule generation. [Edwards et al. (2022)] Under the “Mol2Cap”/“Cap2Mol” task paradigm, in-context molecule adaptation (ICMA) avoids domain-specific pre-training by combining retrieval-augmented in-context learning with fine-tuning on ICL examples.[J. Li et al. (2025)] On the ChEBI-20[Edwards, Zhai, and Ji (2021)] and PubChem324k[Z. Liu et al. (2023)] datasets, ICMA nearly doubles baseline performance, with ICMA powered byMistral-7B achieving a 0.581 bilingual evaluation understudy (BLEU) score in Mol2Cap and \(46.0\%\) exact match in Cap2Mol.[J. Li et al. (2025)] However, its reliance on retrieved examples raises concerns about generalization to novel scaffolds. Similarly, MolReFlect enhances fine-grained alignment through a teacher-student framework, where a larger LLM (e.g., GPT-4) extracts substructure-aware captions to guide a smaller model (Mistral-7B), improving Cap2Mol accuracy while reducing hallucinations.[J. Li et al. (2024)] Meanwhile, PEIT-LLM extends the task to property-conditioned generation, using instructions (SMILES-text-property tuples) to optimize for captioning and prediction jointly.[X. Lin et al. (2025)]

Fine-tuned LMs have shown promise in molecule and materials generation. However, their reliance on decoding and SMILES/SELFIES representations introduces fundamental limitations: degeneracy (multiple valid SMILES for the same molecule) and difficulty capturing complex structural relationships implicit in textual descriptions.

6.2.1.3 Diffusion and Flow Matching

Diffusion and flow-based models operate directly on latent representations, enabling more flexible generation of diverse and novel structures.[Zhu, Xiao, and Honavar (2024)] Moreover, emerging hybrid architectures combine the strengths of LLMs with diffusion and flow matching models to overcome the limitations of each paradigm individually [Sriram et al. (2024)].

Beyond text-based representations, llamole introduced a multimodal LLM approach capable of text and graph generation by integrating a base LLM with graph diffusion transformers and graph neural networks for multi-conditional molecular generation and retrosynthetic planning. Specifically they used different trigger (<design> and <retro>) and query (<query>) tokens for switching between them and improved success in synthesis success rates from \(5\%\) to \(35\%\) . [G. Liu et al. (2024)]

A unique challenge with crystalline materials is generating a material that possesses both discrete (atom type) and continuous (atomic position and lattice geometry) variables. Sriram et al. (2024) developed FlowLLM to address this challenge. They recognized that the respective strengths of LLMs, modeling discrete values and conditional prompting, and denoising models, modeling continuous values and equivariances, could be combined to create a hybrid architecture. A fine-tuned LLM is used to learn an effective base distribution of metastable crystals via text-based representations, which is then iteratively refined through Riemannian flow-matching (RFM) to optimize atomic coordinates and lattice parameters.[Sriram et al. (2024)]

6.2.1.4 Reinforcement Learning and Preference Optimization

Translating GPM generated outputs to the real world requires designing molecules and materials with specific target properties. reinforcement learning (RL) and preference optimization techniques[D. Lee and Cho (2024)] have emerged as powerful solutions for this challenge. For instance, Jang et al. (2025) combined supervised fine-tuning (SFT) and RL using proximal policy optimization (PPO) to generate diverse molecular sequences auto-regressively. This approach excels in exploring a broad chemical space, but incurs high computational costs due to its reliance on iterative, sequence-based generation. In contrast, Cavanagh et al. (2024) employed direct preference optimization (DPO) with SFT to fine-tune LLMs for molecular design, leveraging SMILES representations to optimize drug-like properties (e.g., hydrogen bond donors/acceptors and LogP). While DPO reduces computational overhead in comparison to PPO, it trades off molecular diversity, a key strength of the work by Jang et al. (2025), due to the inherent constraints of preference-based fine-tuning.

Beyond these methods, energy ranking alignment (ERA) introduces a different optimization paradigm. [Chennakesavalu et al. (2025)] Unlike PPO or DPO, ERA uses gradient-based objectives to guide word-by-word generation with explicit reward functions, converging to a physics-inspired probability distribution that allows fine control over the generation process. In single-property optimization tasks, ERA successfully aligned molecular transformers to generate compounds with targeted chemical properties (QED, LogP, ring count, molar refractivity) while maintaining \(59-84\%\) chemical validity without regularization. For multi-objective optimization, it achieved precise control over property trade-offs using weighted energy functions.

Calanzone, D’Oro, and Bacon (2025) also address the challenge of multi-objective molecular generation with Mol-MoE, a mixture of experts (MoE) framework (see Section 3.10.1 to learn more about MoE architectures). Mol-MoE dynamically combines property-specific expert models at test time using preference-guided routers toward drug-relevant molecular properties enabling flexible steering across multiple objectives without re-training. Compared to alternatives like MORLHF[Zhou et al. (2024)], SFT with rewards-in-context, and simple model merging such as Rewarded Soups[Ramé et al. (2023)], Mol-MoE achieves superior performance in both property optimization and steerability—particularly in out-of-distribution scenarios where other methods struggle.

CrystalFormer-RL uses RL fine-tuning to optimize CrystalFormer[Cao et al. (2024)], a transformer-based crystal generator, with rewards from discriminative models (e.g., property predictors)[Cao and Wang (2025)]. RL improves stability (lower energy above convex hull) and enables property-guided generation (e.g., high dielectric constant + band gap). Here, RL fine-tuning is shown to outperform supervised fine-tuning, enhancing both novel material discovery and retrieval of high-performing candidates from the pre-training dataset.

6.2.1.5 Agents

Agent-based frameworks leveraging LLMs, deeply explained in Section 3.12, have emerged as approaches for autonomous molecular and materials generation, demonstrating capabilities that extend beyond simple prompting or fine-tuning by incorporating iterative feedback loops, tool integration, and human-artificial intelligence (AI) collaboration. The dZiner framework implements this approach for the inverse design of materials, where agents input initial SMILES strings with optimization task descriptions and generate validated candidate molecules by retrieving domain knowledge from the literature.[Ansari et al. (2024)] It also uses domain-expert surrogate models to evaluate the required property in the new molecule/material. These surrogate models are highly customizable to the desired property and give the user the option to train their own ML model or using an existing SOTA model. Ansari et al. (2024) demonstrated dZiner’s capabilities in generating surfactants for critical micelle concentration reduction, WDR5 inhibitors, and optimizing MOF organic linkers for CO2 adsorption. The CLADD framework adopts a RAG-enhanced multi-agent approach where specialized teams including “Planning”, “Knowledge Graph”, and “Molecular Understanding” collaborate to dynamically retrieve and integrate external biochemical knowledge for drug discovery tasks without requiring domain-specific fine-tuning.[N. Lee et al. (2025)]

6.2.2 Validation

6.2.2.1 General validation

The most fundamental validation approaches use cheminformatics tools like RDKit to verify molecular validity. RDKit provides robust tools for validating molecules through its ability to parse and sanitize molecules from SMILES strings. If a step in the SMILES to structure conversion process fails, then the molecule is considered invalid. More sophisticated validation involves quantum mechanical calculations to compute molecular properties such as formation energies[Kingsbury et al. (2022)]. These computationally expensive operations provide deeper insights into whether generated structures are viable. Models are also evaluated for their ability to generate unique molecules by calculating the proportion of unique molecules in generated sets, often using molecular fingerprints or structural descriptors.

The gold standard for validation is experimental synthesis, but significant gaps exist between computational generation and laboratory realization. Preliminarily, metrics like Tanimoto similarity and Fréchet ChemNet distance [Preuer et al. (2018)] quantify structural resemblance, which can indicate synthetic feasibility when training data consists of known compounds. Retrosynthesis prediction algorithms attempt to bridge this gap by evaluating synthetic accessibility and proposing potential synthesis routes (see Section 6.3). However, these methods still face limitations in accurately predicting real-world synthesizability [Zunger (2019)].

6.2.2.2 Conditional Generation Validation

Beyond establishing the general validity of generated molecules, evaluation methods can assess both their novelty relative to training data and their ability to meet specific design goals. For inverse design tasks, such as optimizing binding affinity or solubility, the de novo molecule generation benchmark GuacaMol differentiates between distribution-learning (e.g., generating diverse, valid molecules) and goal-directed optimization (e.g., rediscovering known drugs or meeting multi-objective constraints) [Brown et al. (2019)]. In the materials paradigm, frameworks such as MatBench Discovery evaluate analogous challenges such as stability, electronic properties, and synthesizability, but adapt metrics to periodic systems, such as energy above hull or band gap prediction accuracy[Riebesell et al. (2025)]. Recently, they introduced the “discovery acceleration factor”, which quantifies how effective a model is at finding stable structures relative to a random baseline.

6.3 Retrosynthesis

The practical utility of GPMs for generating molecules and materials remains limited by a persistent gap in their synthetic feasibility. Early work by Schwaller et al. (2021) laid important groundwork by demonstrating how attention-based neural networks can learn meaningful representations of chemical reactions, enabling accurate classification and prediction of reaction outcomes. Their model, trained on millions of reactions from patent and literature data, showed that learned reaction embeddings were capable of capturing nuanced chemical relationships.

Recent efforts have built on this foundation by integrating synthesizability directly into molecular and materials generation pipelines that leverage both domain-specific tools and GPMs. For example, Sun et al. (2025) adapted Llama-3.1-8B and Llama-3.2-1B to predict retrosynthetic pathways and identify commercially available building blocks for experimentally validated SARS-CoV-2 Mpro inhibitors. Similarly, G. Liu et al. (2024) introduced a multimodal framework that combines reaction databases with chemical intuition encoded in LLMs, improving the prioritization of high-yield, low-cost synthetic routes.

More recent work has explored how fully fine-tuned LLMs can serve as comprehensive chemistry assistants for experimental guidance. Zhang et al. (2025) used a two-stage training process to first develop Chemma-SFT by fine-tuning LLaMA-2-7B on 1.28 million chemical reaction question-answer pairs about reaction prediction, single-step retrosynthesis, and reaction condition generation tasks. In the second stage of training, they developed Chemma-RM using reinforcement learning from human feedback (RLHF) and applied it to optimize the experimental reaction conditions. Chemma successfully optimized an unreported Suzuki-Miyaura cross-coupling reaction within only 15 experimental runs.

Predictive retrosynthesis has also extended to the inorganic domain. Kim, Jung, and Schrier (2024) demonstrated that fine-tuned GPT-3.5 and GPT-4 can predict both the synthesizability of inorganic compounds from their chemical formulas and select appropriate precursors for synthesis, achieving performance comparable to specialized ML models with minimal development time and cost. In a follow-up work, they extended this approach to structure-based predictions of inorganic crystal polymorphs, where LLMs provided human-readable explanations for their synthesizability assessments[Kim, Schrier, and Jung (2025)]. Notably, their structure-aware models correctly identified twelve hypothetical compounds as non-synthesizable despite their thermodynamic stability, perfectly matching experimental outcomes where synthesis attempts failed.

Beyond retrosynthetic prediction, LLMs have also been deployed as reasoning engines for autonomous design. Bran et al. (2024) developed ChemCrow, an LLM-based system that autonomously plans and executes the synthesis of novel compounds by integrating specialized tools like a retrosynthesis planner (see Section 5.5 to read more about this capability of ChemCrow and its limitations) and reaction predictors. This approach mirrors the iterative experimental design cycle employed by human chemists, but is equipped with the scalability of automation. Notably, systems like ChemCrow rely on high-quality reaction data to ground their reasoning in empirically viable chemistry, which, depending on the design space, could be a limitation.

6.4 LLMs as Optimizers

Figure 6.4: **Overview of the iterative optimization loop that mirrors the structure of the optimization section**. The blue boxes contain the different roles that the llms play in the loop, and which are described in the main text. References in which the use of LLMs for that step are detailed inside the small boxes inside each of the components of the loop. The example shown is about obtaining molecules with high `logP`.

Discovering novel compounds and reactions in chemistry and materials science has long relied on iterative trial-and-error processes rooted in existing domain knowledge [Taylor et al. (2023)]. While, as explained in Section 6.3, those methods are used to accelerate this process, optimization methods help improve conditions, binding affinity, etc. But these approaches are slow and labor-intensive. Traditional data-driven methods aimed to address these limitations by combining predictive ML models with optimization frameworks such as Bayesian optimization (BO) or evolutionary algorithm (EA)s. These frameworks balance exploration of uncharted regions in chemical space with exploitation of known high-performing regions [X. Li et al. (2024); Häse et al. (2021); Shields et al. (2021); Griffiths and Hernández-Lobato (2020); Rajabi-Kochi et al. (2025)].

Recent advances in LLMs have unlocked potential for addressing optimization challenges in chemistry and related domains [Fernando et al. (2023); Yang et al. (2023); Chen et al. (2024)]. A key strength of LLMs lies in their capacity to frame optimization tasks through natural language, which enhances knowledge incorporation, improves candidate comparisons, and increases interpretability. This aligns well with chemical problem-solving, where complex phenomena, such as reaction pathways or material behaviors, are often poorly captured by standard nomenclature; however, they can still be intuitively explained through natural language. Moreover, GPMs’ general capabilities provide flexibility beyond classical methods, which have to be trained from scratch if the optimization problem or any of its variables changes. By encoding domain-specific knowledge—including reaction rules, thermodynamic principles, and structure-property relationships—into structured prompts, LLMs can synergize expertise with their ability to navigate complex chemical optimization problems.

Current LLM applications in chemistry optimization vary in scope and methodology. Many studies integrate LLMs into BO frameworks, where models guide experimental design by predicting promising candidates [Ranković and Schwaller (2023)]. Others employ genetic algorithm (GA)s or hybrid strategies that combine LLM-generated hypotheses with computational screening [Cissé et al. (2025)].

6.4.1 LLMs as Surrogate Models

A prominent LLM-driven strategy positions these models as surrogate models within optimization loops. Typically implemented as Gaussian process regression (GPR), surrogate models learn from prior data to approximate costly feature-outcome landscapes, which are often computationally and time-consuming to evaluate, thereby guiding the acquisition. LLMs offer major advantages in this role primarily through strong low-data performance. Their ICL capability enables task demonstration with minimal prompt examples while leveraging chemical knowledge from pre-training to generate accurate predictions. This allows GPMs to compensate for sparse experimental data effectively.

Ramos et al. (2023) demonstrated the viability of this paradigm through a simple yet effective framework that combines ICL using only one example in the prompt with a BO workflow. Their BO-ICL approach uses few-shot examples formatted as question-answer pairs, where the LLM generates candidate solutions conditioned on prior successful iterations. These candidates are ranked using an acquisition function, with top-\(k\) selections integrated into subsequent prompts to refine predictions iteratively. Remarkably, this method achieved high performance in optimizing catalytic reaction conditions, even matching the top-1 accuracies observed in experimental benchmarks. This emphasizes the potential of LLMs as accessible, ICL optimizers when coupled with well-designed prompts.

To address limitations in base LLMs’ inherent chemical knowledge—particularly their grasp of specialized representations like SMILES or structure-property mappings—J. Yu et al. (2025) introduced a hybrid architecture augmenting pre-trained LLMs with task-specific embedding and prediction layers. These layers, fine-tuned on domain data, align latent representations of input-output pairs (denoted as <x> and <y> in prompts), enabling the model to map chemical structures and properties into a unified, interpretable space. Crucially, the added layers enhance chemical reasoning without sacrificing the flexibility of ICL, allowing the system to adapt to trends across iterations, similarly to what was done by Ramos et al. (2023). In their evaluations of molecular optimization benchmarks, such as the practical molecular optimization (PMO) [Gao et al. (2022)], they revealed improvements over conventional methods, including BO-Gaussian process (GP), RL methods, and GA.

J. Yu et al. (2025) further highlighted the framework’s extensibility to diverse black-box optimization challenges beyond chemistry. This represents one of the most important advantages of using LLMs as orchestrators of the optimization process. The flexibility of natural language in this process enables the procedure to be applied to any optimization process. In contrast, classical methods are constrained to the specific task for which they are designed due to the need to train the surrogate model.

6.4.2 LLMs as Next Candidate Generators

Recent studies demonstrate the potential of LLMs to enhance EAs [Lu et al. (2025)] and BO [Amin, Raja, and Krishnapriyan (2025)] frameworks by leveraging their embedded chemical knowledge and ability to integrate prior information, thereby reducing computational effort while improving output quality. Within EAs, LLMs refine molecular candidates through mutations (modifying molecular substructures) or crossovers (combining parent molecules). In BO frameworks, they serve as acquisition functions, utilizing surrogate model predictions—both mean and uncertainty—to select optimal molecules or reaction conditions for evaluation.

For molecule optimization, J. Yu et al. (2025) introduced MultiMol, a dual-LLM system where one model proposes candidates and the other supplies domain knowledge (see Section 6.4.3). By fine-tuning the “worker” LLM to recognize molecular scaffolds and target properties, and expanding the training pool to include a million-size pre-training dataset, they achieved hit rates exceeding \(90\%\). Similarly, H. Wang, Skreta, et al. (2025) developed MoLLEO, integrating an LLM into an EA to replace random mutations with LLM-guided modifications. Here, GPT-4 generated optimized offspring from parent molecules, significantly accelerating convergence to high fitness scores. Notably, while domain-specialized models (BioT5, MoleculeSTM) underperformed, the general-purpose GPT-4 excelled—a finding that underscores the context-dependent utility of LLMs

In a related approach, Lu et al. (2025) showed that well-designed prompts—incorporating task-specific constraints, objectives, and few-shot examples—enable general LLMs (Claude-Sonnet, o1-preview) to generate high-quality candidates without fine-tuning, outperforming both random selection and vanilla GAs in functional transition metal complexes (TMC) design.

6.4.3 LLMs as Prior Knowledge Sources

A key advantage of integrating LLMs into optimization frameworks is their ability to encode and deploy prior knowledge within the optimization loop. As illustrated in Figure 6.4, this knowledge can be directed into either the surrogate model or candidate generation module, significantly reducing the number of optimization steps required through high-quality guidance.

For example, J. Yu et al. (2025) deployed a “research” agent that leverages Google search and RDKit to verify and rank molecules generated by “worker” agents against target features and properties. Their results demonstrate substantial improvements when this filtering mechanism is applied.

Similarly, Cissé et al. (2025) introduced BORA, which contextualizes conventional black-box BO using an LLM. BORA maintains standard BO as the core driver but strategically activates the LLM when progress stalls. This leverages the model’s ICL capabilities to hypothesize promising search regions and propose new samples, regulated by a lightweight heuristic policy that manages costs and incorporates domain knowledge (or user input). Evaluations on synthetic benchmarks such as the catalyst optimization task for hydrogen generation show that BORA accelerates exploration, improves convergence, and outperforms existing LLM-BO hybrids.

To enhance the task-specific knowledge of the LLM generating feedback, Zhang et al. (2025) fine-tuned a Llama-2-7B model using a multitask Q&A dataset. This dataset was created with instructions from GPT-4. The resulting model served as a human assistant or operated within an active learning loop, thereby accelerating the exploration of new reaction spaces (see Section 6.3). However, as the authors note, even this task-specialized LLMs produces suboptimal suggestions for optimization tasks. They remain prone to hallucination and cannot assist with unreported reactions, but still, they improve for most of the applications, using pure classical methods.

6.4.4 How to Face Optimization Problems?

Published works explore different ways of using LLMs for optimization problems in chemistry, from simple approaches, such as just prompting the model with some initial random set of experimental candidates and iterating [Ramos et al. (2023)], to fine-tuning models in BO fashion [Ranković and Schwaller (2025)]. The most efficient initial point is by relying entirely on a ICL approach, which allows one to obtain a first signal rapidly. Such initial results will enable to determine whether a more complex, computationally intensive approach is necessary or whether prompt engineering is reliable enough for the application. Fine-tuning can be used as a way to enhance the chemical knowledge of the LLMs and can lead to improvements in optimization tasks where the model requires such knowledge to choose or generate better candidates. Fine-tuning might not be a game-changer for other approaches that rely more on sampling methods [H. Wang, Guo, et al. (2025)].

While some initial works showed that LLMs trained specifically on chemistry perform better for optimization tasks [Kristiadi et al. (2024)], other works showed that a GPM such as GPT-4 combined with an EA outperformed all other models [H. Wang, Skreta, et al. (2025)]. Is it better to incorporate a general model or a chemistry LM into the optimization frameworks? We hypothesize that for models of the same size (in number of parameters) and similar training size—attending to peta floating point operations per second (PFLOP)s—a chemical LM (a specialized model) will consistently outperform general models. If the models differ significantly in size, the larger model will typically perform better.

Ahmad, Walid, Elana Simon, Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. 2022. “Chemberta-2: Towards chemical foundation models.” arXiv Preprint. https://doi.org/10.48550/arXiv.2209.01712.

Alampara, Nawaf, Santiago Miret, and Kevin Maik Jablonka. 2024. “MatText: Do language models need more than text & scale for materials modeling?” arXiv Preprint. https://doi.org/10.48550/arXiv.2406.17295.

Amin, Ishan, Sanjeev Raja, and Aditi Krishnapriyan. 2025. “Towards Fast, Specialized Machine Learning Force Fields: Distilling Foundation Models via Energy Hessians.” arXiv Preprint. https://doi.org/10.48550/arXiv.2501.09009.

Aneesh, Anagha, Nawaf Alampara, José A. Márquez, and Kevin Maik Jablonka. 2025. “Semantic Device Graphs for Perovskite Solar Cell Design.” The Thirsteenth International Conference on Learning Representations Workshop on AI for Materials Science, ICLR-AI4MAT. https://openreview.net/forum?id=AGCClISEXL&referrer=%5Bthe%20profile%20of%20Anagha%20Aneesh%5D(%2Fprofile%3Fid%3D~Anagha_Aneesh1).

Ansari, Mehrad, Jeffrey Watchorn, Carla E. Brown, and Joseph S. Brown. 2024. “dZiner: Rational Inverse Design of Materials with AI Agents.” Arxiv Preprint, October. https://doi.org/10.48550/arXiv.2410.03963.

Balaji, Suryanarayanan, Rishikesh Magar, Yayati Jadhav, and Amir Barati Farimani. 2023. “GPT-MolBERTa: GPT Molecular Features Language Model for Molecular Property Prediction.” Arxiv Preprint arXiv:2310.03030, October. https://doi.org/10.48550/arXiv.2310.03030.

Batatia, Ilyes, D. Kov’acs, G. Simm, C. Ortner, and Gábor Csányi. 2022. “MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields.” Neural Information Processing Systems. https://doi.org/10.48550/arXiv.2206.07697.

Bhattacharya, Debjyoti, Harrison J. Cassady, Michael A. Hickner, and Wesley F. Reinhart. 2024. “Large Language Models as Molecular Design Engines.” Journal of Chemical Information and Modeling 64 (18): 7086–96. https://doi.org/10.1021/acs.jcim.4c01396.

Bran, Andres M., Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D. White, and Philippe Schwaller. 2024. “Augmenting Large Language Models with Chemistry Tools.” Nature Machine Intelligence 6 (5). https://doi.org/10.1038/s42256-024-00832-8.

Brown, Nathan, Marco Fiscato, Marwin H. S. Segler, and Alain C. Vaucher. 2019. “GuacaMol: Benchmarking Models for de Novo Molecular Design.” Journal of Chemical Information and Modeling 59 (3): 1096–1108. https://doi.org/10.1021/acs.jcim.8b00839.

Calanzone, Diego, Pierluca D’Oro, and Pierre-Luc Bacon. 2025. “Mol-MoE: Training Preference-Guided Routers for Molecule Generation.” Arxiv Preprint arXiv:2502.05633, February. https://doi.org/10.48550/arXiv.2502.05633.

Cao, Zhendong, Xiaoshan Luo, Jian Lv, and Lei Wang. 2024. “Space Group Informed Transformer for Crystalline Materials Generation.” arXiv Preprint arXiv: 2403.15734. https://doi.org/10.48550/arXiv.2403.15734.

Cao, Zhendong, and Lei Wang. 2025. “CrystalFormer-RL: Reinforcement Fine-Tuning for Materials Design.” Arxiv Preprint arXiv:2504.02367, April. https://doi.org/10.48550/arXiv.2504.02367.

Cavanagh, Joseph M., Kunyang Sun, Andrew Gritsevskiy, Dorian Bagni, Thomas D. Bannister, and Teresa Head-Gordon. 2024. “SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration.” arXiv Preprint arXiv: 2409.02231. https://doi.org/10.48550/arXiv.2409.02231.

Chen, Lichang, Jiuhai Chen, Tom Goldstein, Heng Huang, and Tianyi Zhou. 2024. “InstructZero: Efficient Instruction Optimization for Black-Box Large Language Models.” Forty-First International Conference on Machine Learning, ICML 2024. https://openreview.net/forum?id=rADFNrIss3.

Cheng, Austin H, Andy Cai, Santiago Miret, Gustavo Malkomes, Mariano Phielipp, and Alán Aspuru-Guzik. 2023. “Group SELFIES: A Robust Fragment-Based Molecular String Representation.” Digital Discovery 2 (3): 748–58. https://doi.org/10.1039/D3DD00012E.

Chennakesavalu, Shriram, Frank Hu, Sebastian Ibarraran, and Grant M. Rotskoff. 2025. “Aligning Transformers with Continuous Feedback via Energy Rank Alignment.” Arxiv Preprint arXiv:2405.12961, May. https://doi.org/10.48550/arXiv.2405.12961.

Chiang, Yuan, Elvis Hsieh, Chia-Hong Chou, and Janosh Riebesell. 2024. “LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation.” Arxiv, October. https://doi.org/10.48550/arXiv.2401.17244.

Chithrananda, Seyone, Gabriel Grand, and Bharath Ramsundar. 2020. “ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction.” Arxiv, October. https://doi.org/10.48550/arXiv.2010.09885.

Cissé, Abdoulatif, Xenophon Evangelopoulos, Vladimir V. Gusev, and Andrew I. Cooper. 2025. “Language-Based Bayesian Optimization Research Assistant (BORA).” arXiv Preprint arXiv: 2501.16224. https://doi.org/10.48550/arXiv.2501.16224.

Dinh, Tuan, Yuchen Zeng, Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos, and Kangwook Lee. 2022. “LIFT: Language-Interfaced Fine-Tuning for Non-language Machine Learning Tasks.” Advances in Neural Information Processing Systems 35: 11763–84. https://doi.org/10.48550/arXiv.2206.06565.

Edwards, Carl, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, and Heng Ji. 2022. “Translation Between Molecules and Natural Language.” Arxiv Preprint. https://doi.org/10.48550/arXiv.2204.11817.

Edwards, Carl, ChengXiang Zhai, and Heng Ji. 2021. “Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries.” Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, November, 595–607. https://doi.org/10.18653/v1/2021.emnlp-main.47.

Fernando, Chrisantha, Dylan Banarse, H. Michalewski, Simon Osindero, and Tim Rocktäschel. 2023. “Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution.” International Conference on Machine Learning. https://doi.org/10.48550/arXiv.2309.16797.

Fifty, Christopher, Jure Leskovec, and Sebastian Thrun. 2023. “In-Context Learning for Few-Shot Molecular Property Prediction.” arXiv Preprint arXiv: 2310.08863. https://doi.org/10.48550/arXiv.2310.08863.

Frey, Nathan C., Ryan Soklaski, Simon Axelrod, Siddharth Samsi, Rafael Gómez-Bombarelli, Connor W. Coley, and Vijay Gadepally. 2023. “Neural Scaling of Deep Chemical Models.” Nature Machine Intelligence 5 (11): 1297–1305. https://doi.org/10.1038/s42256-023-00740-3.

Ganose, Alex M, and Anubhav Jain. 2019. “Robocrystallographer: automated crystal structure text descriptions and analysis.” MRS Communications 9 (3): 874–81. https://doi.org/10.1557/mrc.2019.94.

Gao, Wenhao, Tianfan Fu, Jimeng Sun, and Connor W. Coley. 2022. “Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization.” Neural Information Processing Systems. https://doi.org/10.48550/arXiv.2206.12411.

Griffiths, Ryan-Rhys, and José Miguel Hernández-Lobato. 2020. “Constrained Bayesian Optimization for Automatic Chemical Design Using Variational Autoencoders.” Chemical Science 11 (2): 577–86. https://doi.org/10.1039/c9sc04026a.

Gruver, Nate, Marc Anton Finzi, Dylan Sam, J. Zico Kolter, Ben Athiwaratkun, and Andrew Gordon Wilson. 2024. “The Promises and Pitfalls of Language Models for Structured Numerical Data.” OpenReview.net, October. https://openreview.net/forum?id=SZpygmv3G1.

Guo, Taicheng, Kehan Guo, B. Nan, Zhengwen Liang, Zhichun Guo, N. Chawla, O. Wiest, and Xiangliang Zhang. 2023. “What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks.” Neural Information Processing Systems. https://doi.org/10.48550/arXiv.2305.18365.

Hall, S. R., F. H. Allen, and I. D. Brown. 1991. “The Crystallographic Information File (CIF): A New Standard Archive File for Crystallography.” Acta Crystallographica Section A 47 (6): 655–85. https://doi.org/10.1107/S010876739101067X.

Häse, Florian, Matteo Aldeghi, Riley J. Hickman, Loı̈c M. Roch, and Alán Aspuru-Guzik. 2021. “G<scp>ryffin</Scp>: An Algorithm for Bayesian Optimization of Categorical Variables Informed by Expert Knowledge.” Applied Physics Reviews 8 (3). https://doi.org/10.1063/5.0048164.

Huang, Bing, and O. Anatole von Lilienfeld. 2016. “Understanding Molecular Representations in Machine Learning: The Role of Uniqueness and Target Similarity.” arXiv Preprint arXiv: 1608.06194. https://doi.org/10.48550/arXiv.1608.06194.

Isert, Clemens, Kenneth Atz, José Jiménez-Luna, and Gisbert Schneider. 2022. “QMugs, quantum mechanical properties of drug-like molecules.” Scientific Data 9 (1). https://doi.org/10.1038/s41597-022-01390-7.

Jablonka, Kevin Maik, Qianxiang Ai, Alexander Al-Feghali, Shruti Badhwar, Joshua D. Bocarsly, Andres M. Bran, Stefan Bringuier, et al. 2023. “14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon.” Digital Discovery 2 (5): 1233–50. https://doi.org/10.1039/d3dd00113j.

Jablonka, Kevin Maik, Philippe Schwaller, Andres Ortega-Guerrero, and Berend Smit. 2024. “Leveraging large language models for predictive chemistry.” Nature Machine Intelligence 6 (2): 161–69. https://doi.org/10.1038/s42256-023-00788-1.

Jang, Hyosoon, Yunhui Jang, Jaehyung Kim, and Sungsoo Ahn. 2025. “Can LLMs Generate Diverse Molecules? Towards Alignment with Structural Diversity.” Arxiv Preprint arXiv:2410.03138, February. https://doi.org/10.48550/arXiv.2410.03138.

Kaur, Harveen, Flaviano Della Pia, Ilyes Batatia, Xavier R Advincula, Benjamin X Shi, Jinggang Lan, Gábor Csányi, Angelos Michaelides, and Venkat Kapil. 2025. “Data-Efficient Fine-Tuning of Foundational Models for First-Principles Quality Sublimation Enthalpies.” Faraday Discussions 256: 120–38. https://doi.org/10.1039/d4fd00107a.

Kim, Seongmin, Yousung Jung, and Joshua Schrier. 2024. “Large Language Models for Inorganic Synthesis Predictions.” Journal of the American Chemical Society.

Kim, Seongmin, Joshua Schrier, and Yousung Jung. 2025. “Explainable Synthesizability Prediction of Inorganic Crystal Polymorphs Using Large Language Models.” Angewandte Chemie International Edition. https://doi.org/10.1002/anie.202423950.

Kingsbury, Ryan S., Andrew S. Rosen, Ayush S. Gupta, Jason M. Munro, Shyue Ping Ong, Anubhav Jain, Shyam Dwaraknath, Matthew K. Horton, and Kristin A. Persson. 2022. “A Flexible and Scalable Scheme for Mixing Computed Formation Energies from Different Levels of Theory.” Npj Computational Materials. https://doi.org/10.1038/s41524-022-00881-w.

Krenn, Mario, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and Alan Aspuru-Guzik. 2020. “Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation.” Machine Learning: Science and Technology 1 (4): 045024. https://doi.org/10.1088/2632-2153/aba947.

Kristiadi, Agustinus, Felix Strieth-Kalthoff, Marta Skreta, Pascal Poupart, Alán Aspuru-Guzik, and Geoff Pleiss. 2024. “A Sober Look at LLMs for Material Discovery: Are They Actually Good for Bayesian Optimization over Molecules?” Forty-First International Conference on Machine Learning, ICML 2024. https://doi.org/10.48550/arXiv.2402.05015.

Krzyzanowski, Adrian, Stephen D. Pickett, and Peter Pogány. 2025. “Exploring BERT for Reaction Yield Prediction: Evaluating the Impact of Tokenization, Molecular Representation, and Pretraining Data Augmentation.” Journal of Chemical Information and Modeling 65 (9): 4381–4402. https://doi.org/10.1021/acs.jcim.5c00359.

Kuhn, Michael, Ivica Letunic, Lars Juhl Jensen, and Peer Bork. 2016. “The SIDER database of drugs and side effects.” Nucleic Acids Research 44 (D1): D1075–79. https://doi.org/10.1093/nar/gkv1075.

Lee, Daeseok, and Yongjun Cho. 2024. “FINE-TUNING POCKET-CONDITIONED 3D MOLECULE GENERATION VIA REINFORCEMENT LEARNING.” The Twelfth International Conference on Learning Representations Workshop on Generative and Experimental Perspectives for Biomolecular Design, ICLR-GEM. https://openreview.net/forum?id=hlzRzr9ksu.

Lee, Namkyeong, Edward De Brouwer, Ehsan Hajiramezanali, Tommaso Biancalani, Chanyoung Park, and Gabriele Scalia. 2025. “RAG-Enhanced Collaborative LLM Agents for Drug Discovery.” arXiv Preprint arXiv: 2502.17506. https://doi.org/10.48550/arXiv.2502.17506.

Li, Jiatong, Wei Liu, Zhihao Ding, Wenqi Fan, Yuqiang Li, and Qing Li. 2025. “Large Language Models Are in-Context Molecule Learners.” IEEE Transactions on Knowledge and Data Engineering 37 (7). https://doi.org/10.1109/TKDE.2025.3557697.

Li, Jiatong, Yunqing Liu, Wei Liu, Jingdi Le, Di Zhang, Wenqi Fan, Dongzhan Zhou, Yuqiang Li, and Qing Li. 2024. “MolReFlect: Towards In-Context Fine-Grained Alignments Between Molecules and Texts.” Arxiv Preprint arXiv:2411.14721, November. https://doi.org/10.48550/arXiv.2411.14721.

Li, Xiaobo, Yu Che, Linjiang Chen, Tao Liu, Kewei Wang, Lunjie Liu, Haofan Yang, Edward O. Pyzer-Knapp, and Andrew I. Cooper. 2024. “Sequential Closed-Loop Bayesian Optimization as a Guide for Organic Molecular Metallophotocatalyst Formulation Discovery.” Nature Chemistry 16 (8): 1286–94. https://doi.org/10.1038/s41557-024-01546-5.

Li, Zhuoran, Xu Sun, Wanyu Lin, and Jiannong Cao. 2024. “Unveiling Molecular Secrets: An LLM-Augmented Linear Model for Explainable and Calibratable Molecular Property Prediction.” arXiv Preprint arXiv: 2410.08829. https://doi.org/10.48550/arXiv.2410.08829.

Lin, Li-Chiang, Adam H. Berger, Richard L. Martin, Jihan Kim, Joseph A. Swisher, Kuldeep Jariwala, Chris H. Rycroft, et al. 2012. “In silico screening of carbon-capture materials.” Nature Materials 11 (7): 633–41. https://doi.org/10.1038/nmat3336.

Lin, Xuan, Long Chen, Yile Wang, Xiangxiang Zeng, and Philip S. Yu. 2025. “Property Enhanced Instruction Tuning for Multi-Task Molecule Generation with Large Language Models.” Arxiv Preprint arXiv:2412.18084, May. https://doi.org/10.48550/arXiv.2412.18084.

Liu, Gang, Michael Sun, Wojciech Matusik, Meng Jiang, and Jie Chen. 2024. “Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning.” Arxiv Preprint arXiv: 2410.04223, October. https://doi.org/10.48550/arXiv.2410.04223.

Liu, Hongxuan, Haoyu Yin, Zhiyao Luo, and Xiaonan Wang. 2025. “Integrating Chemistry Knowledge in Large Language Models via Prompt Engineering.” Synthetic and Systems Biotechnology 10 (1): 23–38. https://doi.org/10.1016/j.synbio.2024.07.004.

Liu, Yuyan, Sirui Ding, Sheng Zhou, Wenqi Fan, and Qiaoyu Tan. 2024. “MolecularGPT: Open Large Language Model (LLM) for Few-Shot Molecular Property Prediction.” Arxiv Preprint arXiv:2406.12950, October. https://doi.org/10.48550/arXiv.2406.12950.

Liu, Zhiyuan, Sihang Li, Yanchen Luo, Hao Fei, Yixin Cao, Kenji Kawaguchi, Xiang Wang, and Tat-Seng Chua. 2023. “MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter.” arXiv Preprint arXiv:2310.12798v4, October. https://doi.org/10.48550/arXiv.2310.12798.

Lu, Jieyu, Zhangde Song, Qiyuan Zhao, Yuanqi Du, Yirui Cao, Haojun Jia, and Chenru Duan. 2025. “Generative Design of Functional Metal Complexes Utilizing the Internal Knowledge and Reasoning Capability of Large Language Models.” Journal of the American Chemical Society, July. https://doi.org/10.1021/jacs.5c02097.

Mitchell, John B. O. 2017. “DLS-100 Solubility Dataset.” https://doi.org/10.17630/3A3A5ABC-8458-4924-8E6C-B804347605E8.

Mobley, David L., and J. Peter Guthrie. 2014. “FreeSolv: a database of experimental and calculated hydration free energies, with input files.” Journal of Computer-Aided Molecular Design 28 (7). https://doi.org/10.1007/s10822-014-9747-x.

Preuer, Kristina, Philipp Renz, Thomas Unterthiner, Sepp Hochreiter, and Günter Klambauer. 2018. “Fréchet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery.” Journal of Chemical Information and Modeling 58 (9): 1736–41. https://doi.org/10.1021/acs.jcim.8b00234.

Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. “Exploring the limits of transfer learning with a unified text-to-text transformer.” Journal of Machine Learning Research 21 (140): 1–67. https://www.jmlr.org/papers/v21/20-074.html.

Rajabi-Kochi, Mahyar, Negareh Mahboubi, Aseem Partap Singh Gill, and Seyed Mohamad Moosavi. 2025. “Adaptive Representation of Molecules and Materials in Bayesian Optimization.” Chemical Science 16 (13): 5464–74. https://doi.org/10.1039/d5sc00200a.

Ramé, Alexandre, Guillaume Couairon, Mustafa Shukor, Corentin Dancette, Jean-Baptiste Gaya, Laure Soulier, and Matthieu Cord. 2023. “Rewarded Soups: Towards Pareto-Optimal Alignment by Interpolating Weights Fine-Tuned on Diverse Rewards.” Arxiv Preprint arXiv:2306.04488, October. https://doi.org/10.48550/arXiv.2306.04488.

Ramos, Mayk Caldas, Shane S. Michtavy, Marc D. Porosoff, and Andrew D. White. 2023. “Bayesian Optimization of Catalysis with in-Context Learning.” arXiv Preprint arXiv: 2304.05341. https://doi.org/10.48550/arXiv.2304.05341.

Ranković, Bojana, and Philippe Schwaller. 2023. “BoChemian: Large Language Model Embeddings for Bayesian Optimization of Chemical Reactions.” NeurIPS 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World. https://openreview.net/forum?id=A1RVn1m3J3.

———. 2025. “GOLLuM: Gaussian Process Optimized LLMs - Reframing LLM Finetuning Through Bayesian Optimization.” arXiv Preprint arXiv: 2504.06265. https://doi.org/10.48550/arXiv.2504.06265.

Richard, Ann M., Ruili Huang, Suramya Waidyanatha, Paul Shinn, Bradley J. Collins, Inthirany Thillainadarajah, Christopher M. Grulke, et al. 2021. “The Tox21 10K Compound Library: Collaborative Chemistry Advancing Toxicology.” Chemical Research in Toxicology 34 (2): 189–216. https://doi.org/10.1021/acs.chemrestox.0c00264.

Riebesell, Janosh, Rhys E. A. Goodall, Philipp Benner, Yuan Chiang, Bowen Deng, Gerbrand Ceder, Mark Asta, Alpha A. Lee, Anubhav Jain, and Kristin A. Persson. 2025. “A Framework to Evaluate Machine Learning Crystal Stability Predictions.” Nature Machine Intelligence. https://doi.org/10.1038/s42256-025-01055-1.

Rubungo, Andre Niyongabo, Craig Arnold, Barry P. Rand, and Adji Bousso Dieng. 2023. “LLM-Prop: Predicting Physical And Electronic Properties Of Crystalline Solids From Their Text Descriptions.” arXiv Preprint arXiv: 2310.14029. https://doi.org/10.48550/arXiv.2310.14029.

Sakiyama, Hiroshi, Motohisa Fukuda, and Takashi Okuno. 2021. “Prediction of Blood-Brain Barrier Penetration (BBBP) Based on Molecular Descriptors of the Free-Form and In-Blood-Form Datasets.” Molecules 26 (24). https://doi.org/10.3390/molecules26247428.

Schwaller, Philippe, Daniel Probst, Alain C. Vaucher, Vishnu H. Nair, David Kreutter, Teodoro Laino, and Jean-Louis Reymond. 2021. “Mapping the Space of Chemical Reactions Using Attention-Based Neural Networks.” Nature Machine Intelligence 3 (2): 144–52. https://doi.org/10.1038/s42256-020-00284-w.

Shields, Benjamin J., Jason Stevens, Jun Li, Marvin Parasram, Farhan Damani, Jesus I. Martinez Alvarado, Jacob M. Janey, Ryan P. Adams, and Abigail G. Doyle. 2021. “Bayesian Reaction Optimization as a Tool for Chemical Synthesis.” Nature 590 (7844): 89–96. https://doi.org/10.1038/s41586-021-03213-y.

Shoghi, Nima, Adeesh Kolluru, John R. Kitchin, Zachary W. Ulissi, C. L. Zitnick, and Brandon M. Wood. 2023. “From Molecules to Materials: Pre-Training Large Generalizable Models for Atomic Property Prediction.” International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2310.16802.

Soares, Eduardo, Emilio Vital Brazil, Victor Shirasuna, Dmitry Zubarev, Renato Cerqueira, and Kristin Schmidt. 2025. “A Mamba-Based Foundation Model for Materials.” Npj Artificial Intelligence 1 (1): 1–8. https://doi.org/10.1038/s44387-025-00009-7.

Srinivas, Sakhinana Sagar, and Venkataramana Runkana. 2024a. “Crossing New Frontiers: Knowledge-Augmented Large Language Model Prompting for Zero-Shot Text-Based de Novo Molecule Design.” arXiv Preprint arXiv: 2408.11866. https://doi.org/10.48550/arXiv.2408.11866.

———. 2024b. “Cross-Modal Learning for Chemistry Property Prediction: Large Language Models Meet Graph Machine Learning.” Arxiv Preprint arXiv: 2408.14964, August. https://doi.org/10.48550/arXiv.2408.14964.

Sriram, Anuroop, Benjamin Kurt Miller, Ricky T. Q. Chen, and Brandon M. Wood. 2024. “FlowLLM: Flow Matching for Material Generation with Large Language Models as Base Distributions.” Arxiv Preprint arXiv, October. https://doi.org/10.48550/arXiv.2410.23405.

Sun, Kunyang, Dorian Bagni, Joseph M. Cavanagh, Yingze Wang, Jacob M. Sawyer, Andrew Gritsevskiy, Oufan Zhang, and Teresa Head-Gordon. 2025. “SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models.” Arxiv Preprint arXiv: 2503.12602, April. https://doi.org/10.48550/arXiv.2503.12602.

Sypetkowski, Maciej, Frederik Wenkel, Farimah Poursafaei, Nia Dickson, Karush Suri, Philip Fradkin, and Dominique Beaini. 2024. “On the Scalability of Gnns for Molecular Graphs.” Advances in Neural Information Processing Systems 37: 19870–906. https://doi.org/10.48550/arXiv.2404.11568.

Taylor, Connor J., Alexander Pomberger, Kobi C. Felton, Rachel Grainger, Magda Barecka, Thomas W. Chamberlain, Richard A. Bourne, Christopher N. Johnson, and Alexei A. Lapkin. 2023. “A Brief Introduction to Chemical Reaction Optimization.” Chemical Reviews 123 (6): 3089–3126. https://doi.org/10.1021/acs.chemrev.2c00798.

Van Herck, Joren, Marı́a Victoria Gil, Kevin Maik Jablonka, Alex Abrudan, Andy S. Anker, Mehrdad Asgari, Ben Blaiszik, et al. 2025. “Assessment of fine-tuned large language models for real-world chemistry and material science applications.” Chemical Science 16 (2): 670–84. https://doi.org/10.1039/D4SC04401K.

Wang, Anthony Yu-Tung, Steven K. Kauwe, Ryan J. Murdock, and Taylor D. Sparks. 2021. “Compositionally restricted attention-based network for materials property predictions.” Npj Computational Materials 7 (1). https://doi.org/10.1038/s41524-021-00545-1.

Wang, Haorui, Jeff Guo, Lingkai Kong, Rampi Ramprasad, Philippe Schwaller, Yuanqi Du, and Chao Zhang. 2025. “LLM-Augmented Chemical Synthesis and Design Decision Programs.” arXiv Preprint arXiv: 2505.07027. https://doi.org/10.48550/arXiv.2505.07027.

Wang, Haorui, Marta Skreta, Cher Tian Ser, Wenhao Gao, Lingkai Kong, Felix Strieth-Kalthoff, Chenru Duan, et al. 2025. “Efficient Evolutionary Search over Chemical Space with Large Language Models.” The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. https://doi.org/10.48550/arXiv.2406.16976.

Weininger, David. 1988. “SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules.” Journal of Chemical Information and Computer Sciences 28 (1). https://doi.org/10.1021/ci00057a005.

Wu, Zhenqin, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. 2018. “MoleculeNet: a benchmark for molecular machine learning.” Chemical Science 9 (2): 513–30. https://doi.org/10.1039/c7sc02664a.

Yang, Chengrun, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. 2023. “Large Language Models as Optimizers.” arXiv Preprint arXiv: 2309.03409. https://doi.org/10.48550/arXiv.2309.03409.

Yoshikai, Yasuhiro, Tadahaya Mizuno, Shumpei Nemoto, and Hiroyuki Kusuhara. 2024. “A Novel Molecule Generative Model of VAE Combined with Transformer for Unseen Structure Generation.” arXiv Preprint arXiv: 2402.11950. https://doi.org/10.48550/arXiv.2402.11950.

Yu, Botao, Frazier N. Baker, Ziqi Chen, Xia Ning, and Huan Sun. 2024. “LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset.” arXiv Preprint arXiv: 2402.09391. https://doi.org/10.48550/arXiv.2402.09391.

Yu, Jiajun, Yizhen Zheng, Huan Yee Koh, Shirui Pan, Tianyue Wang, and Haishuai Wang. 2025. “Collaborative Expert LLMs Guided Multi-Objective Molecular Optimization.” arXiv Preprint. https://doi.org/10.48550/arXiv.2503.03503.

Zhang, Yu, Yang Han, Shuai Chen, Ruijie Yu, Xin Zhao, Xianbin Liu, Kaipeng Zeng, et al. 2025. “Large Language Models to Accelerate Organic Chemistry Synthesis.” Nature Machine Intelligence. https://doi.org/10.1038/s42256-025-01066-y.

Zheng, Yizhen, Huan Yee Koh, Jiaxin Ju, Anh T. N. Nguyen, Lauren T. May, Geoffrey I. Webb, and Shirui Pan. 2025. “Large language models for scientific discovery in molecular property prediction.” Nature Machine Intelligence 7 (3): 437–47. https://doi.org/10.1038/s42256-025-00994-z.

Zhou, Zhanhui, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. 2024. “Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization.” Arxiv Preprint. https://doi.org/10.48550/arXiv.2310.03708.

Zhu, Huaisheng, Teng Xiao, and Vasant G. Honavar. 2024. “3M-Diffusion: Latent Multi-Modal Diffusion for Language-Guided Molecular Structure Generation.” Arxiv Preprint, October. https://doi.org/10.48550/arXiv.2403.07179.

Zunger, Alex. 2019. “Beware of plausible predictions of fantasy materials.” Nature 566 (7745): 447–49. https://doi.org/10.1038/d41586-019-00676-y.