5 Applications

5.1 Automating the Scientific Workflow

Recent advances in general-purpose model (GPM)s, particularly large language model (LLM)s, have enabled initial demonstrations of fully autonomous artificial intelligence (AI) scientists [Schmidgall et al. (2025)]. We define these as LLM-powered architectures capable of executing end-to-end research workflows based solely on the final objectives, e.g., “Unexplained rise of antimicrobial resistance in Pseudomonas. Formulate hypotheses, design confirmatory in vitro assays, and suggest repurposing candidates for liver-fibrosis drugs”. Such systems navigate partially or entirely through all components of the scientific process outlined in Figure 5.1, and detailed in the subsequent sections.

While significant applications emerge in machine learning (ML) and programming, scientific implementations remain less explored.

5.1.1 Coding and ML Applications of AI Scientists

Notable frameworks, including Co-Scientist [Gottweis et al. (2025)], and \acr{ai}``-Scientist [Yamada et al. (2025)], aim to automate the entire ML research pipeline, typically employing multi-agent architectures (described in detail in Section 3.12.2.1) where specialized agents manage distinct phases of the scientific method [Schmidgall and Moor (2025)]. Critical to these systems is self-reflection [Renze and Guven (2024)]—iterative evaluation and criticism of results within validation loops. However, comparative analyses reveal that LLM-based evaluations frequently overscore outputs relative to human assessments [Q. Huang et al. (2023); Chan et al. (2024); Starace et al. (2025)]. From an engineering perspective, alternative approaches focus specifically on iterative code optimization, enabling systems to refine their codebases [J. Zhang et al. (2025)] or generate improved agents autonomously [Hu, Lu, and Clune (2024)]. In another work, AlphaEvolve [Novikov et al. (2025)], which is an LLM operating within a genetic algorithm (GA) environment, found novel algorithms for matrix multiplication (which had seen no innovation in fifty years) and sorting.

5.1.3 Are these Systems Capable of Real Autonomous Research?

Although agents like Zochi [Intology.ai (2025)] achieved peer-reviewed publication in top-tier venues (association for computational linguistics (ACL) 2025), their capacity for truly autonomous end-to-end research remains debatable [Son et al. (2025)]. Even when generating hypotheses that appear novel and impactful, their execution and reporting of these ideas, as demonstrated by Si, Hashimoto, and Yang (2025), yield results deemed less attractive than those produced by humans. Additionally, this autonomy raises a critical question: What should the role of AI in science be? While these systems can generate hypotheses, conduct experiments, and produce publication-ready manuscripts, their integration demands careful consideration (refer to Section 7.3 for further discussion about moral concerns around these systems). Beyond the vision of fully autonomous scientists, GPMs—primarily LLMs—are already utilized across most scientific workflow components, for which LLMs have proven useful for some. These elements are shown in Figure 5.1, and we discuss next.

Figure 5.1: **Overview of the scientific process**. The outer elements represent the typical scientific research process: from gathering information and generating hypotheses based on the observations, to executing experiments and analyzing the results. The terms that are in the center represent data-driven “shortcuts” that “accelerate” the typical scientific method. All stages represented in the figure are discussed in detail in the following sections.

5.2 Existing GPMs for Chemical Science

The development of GPMs for chemical science represents a departure from traditional single-task approaches. Rather than fine-tuning pre-trained models for specific applications such as property prediction or molecular generation, these chemistry-aware language models are intentionally designed to perform multiple chemical tasks simultaneously. This multitask paradigm offers several advantages: shared representations across related chemical problems[Christofidellis et al. (2023)], improved data efficiency through transfer learning between tasks[Lim and Lee (2020)], and the potential for emergent capabilities that arise from joint training across diverse chemical domains[Livne et al. (2024)].

5.2.1 Multitask Learning Frameworks

DARWIN 1.5 pioneered the multitask approach by fine-tuning Llama-7B through a two-stage process [Xie et al. (2025)]. Initially trained on 332k scientific question-answer pairs to establish foundational scientific reasoning, the model subsequently underwent multitask learning to perform property prediction-related regression and classification tasks concurrently.

Building on similar principles, nach0 introduced a unified encoder-decoder transformer architecture for cross-domain chemical tasks [Livne et al. (2024)]. Pre-trained using self-supervised learning (SSL) on both natural language and chemical data, nach0 tackles diverse downstream applications including molecular structure generation, chemical property prediction, and reaction prediction. Notably, the authors found that combining chemistry-specific tasks outperformed models trained on distinct task groups, suggesting that chemical reasoning benefits from focused domain integration.

In the materials domain, Qu et al. (2023) developed a language-driven materials discovery framework that uses transformer-based embeddings (e.g., MatBERT[Wan et al. (2024)]) to represent and generate novel crystal structures. Candidates are first recalled via similarity in embedding space, then ranked using a multitask multi-gate mixture of experts (MoE) model that predicts the desired properties jointly. Their method successfully identifies novel high-performance materials (e.g., halide perovskites) and demonstrates that language representations encode latent knowledge for task-agnostic materials design.

5.2.1.1 Domain-Specific Pre-Training Strategies

A second category of GPMs emphasizes deep domain knowledge through specialized pre-training. LLaMat employed parameter-efficient fine-tuning (PEFT) specifically on crystal structure data in crystallographic information file (CIF) format, enabling the generation of thermodynamically stable structures [Mishra et al. (2024)].

ChemDFM scales this concept significantly, implementing domain pre-training on over 34 billion tokens from chemical textbooks and research articles [Zhao et al. (2024)]. Through comprehensive instruction tuning, ChemDFM familiarizes itself with chemical notation and patterns, distinguishing it from more materials-focused approaches like LLaMat through its broader chemical knowledge base.

ChemLLM further refined this approach by introducing template-based instruction tuning (ChemData) to optimize property-guided molecular generation [D. Zhang et al. (2024)].

5.2.1.2 Reasoning-Based Approaches

A recent development in chemical GPMs incorporates explicit reasoning capabilities. ether0 demonstrates this approach as a 24 billion parameter reasoning model trained on over 640k experimentally-grounded chemistry problems across diverse tasks, including retrosynthesis, solubility editing, and property prediction [Narayanan et al. (2025)]. Unlike previous models, ether0 uses reinforcement learning (RL) (see Section 3.7) to develop reasoning behaviors like verification and backtracking, demonstrating that structured problem-solving approaches can significantly improve performance on complex chemical tasks while maintaining grounding in experimental data.

These diverse approaches illustrate the evolving landscape of chemical GPMs, each balancing broad applicability with domain-specific precision. Still, most applications of GPMs focus on using these models for one specific application and we will review those in the following.

5.3 Knowledge Gathering

The rate of publishing keeps growing, and as a result, it is increasingly challenging to manually collect all relevant knowledge, potentially stymying scientific progress.[Schilling-Wilhelmi et al. (2025); Chu and Evans (2021)] Even though knowledge collection might seem like a simple task, it often involves multiple different steps, visually described in Figure 5.2. We split the discussion in this section in two: structured data extraction and question answering. Example queries for both sections are in Figure 5.2 B.

Figure 5.2: **A. a representation of a typical agent for scientific queries.** The LLM is the central piece of the system, surrounded by typical tools that improve its question-answering capabilities, together forming an agentic system. The tools represented in this figure are semantic search, citation traversal, evidence gathering, and question answering. Semantic search finds relevant documents. Evidence gathering ranks and filters chunks of text using LLMs. The citation traversal tool provides model access to citation graphs, enabling accurate referencing of each chunk and facilitating the discovery of additional sources. Finally, the question-answering tool (an llm) collects all the information found by other tools and generates a final response to a user’s query. This part the figure is inspired by the `PaperQA2` agent.(Skarlinski et al. 2024) B. two examples of applications discussed in this section.

5.3.1 Semantic Search

A step that is key to most, if not all, knowledge-gathering tasks is retrieval-augmented generation (RAG), discussed in more detail in Section 3.12.1.3. Most commonly, this involves semantic search, intended to identify chunks of text with similar meaning. The difference between semantic search and conventional search lies in how each approach interprets queries. The latter operates through lexical matching—whether exact or fuzzy—focusing on the literal words and their variations. Semantic search, however, goes deeper by interpreting the underlying meaning and contextual relationships within the content.

To enable semantic search, documents are stored in vector databases using embedding vectors (see Section 3.2.3).[Bojanowski et al. (2017)] They represent the content of a document as a vector in a learned vector space and hence allow for similarity search by vector comparison (e.g., using cosine similarity for small databases or more sophisticated algorithms like hierarchical navigable small world (HNSW) for large databases[Malkov and Yashunin (2018)]).

In chemistry, semantic search has been used extensively to classify and identify chemical text.[Guo et al. (2021); Beltagy, Lo, and Cohan (2019); Trewartha et al. (2022)]

5.3.2 Structured Data Extraction

Semantic search can help us find relevant resources. However, for many applications it can be useful to collect data in a structured form, e.g., tables with fixed columns. Obtaining such a dataset based on extracting data from the literature using LLMs is currently one of the most practical avenues for the use of LLMs in the chemical sciences [Schilling-Wilhelmi et al. (2025)]. Currently, LLMs are used in various forms for this application.

5.3.2.1 Data Extraction Using Prompting

For most applications, zero-shot prompting should be the starting point. In this context, zero-shot prompting has been used to extract data about organic reactions[Rı́os-Garcı́a and Jablonka (2025); Vangala et al. (2024); Patiny and Godin (2023)], synthesis procedures for metal-organic frameworks[Zheng et al. (2023)], polymers[Schilling-Wilhelmi and Jablonka (2024); Gupta et al. (2024)], solar cells[Shabih et al. (2025)], or materials data[Polak and Morgan (2024); Hira et al. (2024); Kumar, Kabra, and Cole (2025); Wu et al. (2025); S. Huang and Cole (2022)].

5.3.2.2 Fine-tuning Based Data Extraction

If a commercial model needs to be run very often, it can be more cost-efficient to fine-tune a smaller, open-source model compared to prompting a large model. In addition, models might lack specialized knowledge and might not follow certain style guides, which can be introduced with fine-tuning. Ai et al. (2024) fine-tuned the LLaMa-2-7B model to extract procedural chemical reaction data from United States Patent and Trademark Office (USPTO), and converted it to a JavaScript object notation (JSON) format compatible with the schema of Open Reaction Database (ORD)[Kearnes et al. (2021)], achieving an overall accuracy of more than \(90\%\). In a different work, W. Zhang et al. (2024) fine-tuned GPT-3.5-Turbo to recognize and extract chemical entities from USPTO. Fine-tuning improved the performance of the base model on the same task by more than \(15\%\). Similarly, Dagdelen et al. (2024) proposed a human-in-the-loop data annotation process, where humans correct the outputs from an LLM extraction instead of extracting data from scratch.

5.3.2.3 Agents for Data Extraction

Agents (Section 3.12) have shown their potential in data extraction, though to a limited extent.[K. Chen et al. (2024); Kang and Kim (2024)] For example, Ansari and Moosavi (2024) introduced Eunomia, an agent that autonomously extracts structured materials science data from scientific literature without requiring fine-tuning, demonstrating performance comparable to or better than fine-tuned methods. Their agent is an LLM with access to tools such as chemical databases (e.g., the Materials Project database) and research papers from various sources.

While the authors claim this approach simplifies dataset creation for materials discovery, the evaluation is limited to a narrow set of materials science tasks (mostly focusing on metal-organic framework (MOF)s), indicating the need for the creation of agent evaluation tools.

5.3.2.4 Limitations

The ability to extract data from sources other than text is important since a large amount of data is only stored in plots, tables, and figures. Despite some initial simple proofs of concept [Zheng et al. (2024)], the main bottleneck presently is the limited understanding of image data compared to text data in multimodal models.[Alampara et al. (2024)] The promise of agents lies in their ability to interact with tools (that can also interpret multimodal data). Moreover, their ability to self-reflect could automatically improve wrong results.[Du et al. (2023)]

5.3.3 Question Answering

Besides extracting information from documents in a structured format, LLMs can also be used to answer questions—such as “Has X been tried before” by synthesizing knowledge from a corpus of documents (and potentially automatically retrieving additional documents).

An example of a system that can do that is PaperQA. This agentic system contains tools for search, evidence-gathering, and question answering as well as for traversing citation graphs, which are shown in Figure 5.2. The evidence-gathering tool collects the most relevant chunks of information via the semantic search and performs LLM-based re-ranking of these chunks (i.e. the LLM changes the order of the chunks depending on what is needed to answer the query). Subsequently, only the top-\(n\) most relevant chunks are kept. To further ground the responses, citation traversal tools (e.g., Semantic Scholar[Kinney et al. (2023)]) are used. These leverage the citation graph as a means of discovering supplementary literature references. Ultimately, to address the user’s query, a question-answering tool is employed. It initially augments the query with all the collected information before providing a definitive answer. The knowledge aggregated by these systems could be used to generate new hypotheses or challenge existing ones. Thus, in the next section, we focus on this aspect.

5.4 Hypothesis Generation

Coming up with new hypotheses represents a cornerstone of the scientific process [Rock (2018)]. Historically, hypotheses have emerged from systematic observation of natural phenomena, exemplified by Isaac Newton’s formulation of the law of universal gravitation [Newton (1999)], which was inspired by the seemingly mundane observation of a falling apple [Kosso (2017)].

In modern research, hypothesis generation increasingly relies on data-driven tools. For example, clinical research employs frameworks such as visual interactive analytic tool for filtering and summarizing large health data sets (VIADS) to derive testable hypotheses from well-curated datasets [Jing et al. (2022)]. Similarly, advances in LLMs are now being explored for their potential to automate and enhance idea generation across scientific domains [O’Neill et al. (2025)]. However, such approaches face significant challenges due to the inherently open-ended nature of scientific discovery [Stanley, Lehman, and Soros (2017)]. Open-ended domains, as discussed in Chapter 2, risk intractability, as an unbounded combinatorial space of potential variables, interactions, and experimental parameters complicates systematic exploration [Clune (2019)]. Moreover, the quantitative evaluation of the novelty and impact of generated hypotheses remains non-trivial. As Karl Popper argued, scientific discovery defies rigid logical frameworks [Popper (1959)], and objective metrics for “greatness” of ideas are elusive [Stanley and Lehman (2015)]. These challenges underscore the complexity of automating or systematizing the creative core of scientific inquiry.

5.4.1 Initial Sparks

Recent efforts in the ML community have sought to simulate the hypothesis formulation process [Gu and Krenn (2025); Arlt et al. (2024)], primarily leveraging multi-agent systems [Jansen et al. (2025); Kumbhar et al. (2025)]. In such frameworks, agents typically retrieve prior knowledge to contextualize previous related work—grounding hypothesis generation in existing literature [Naumov et al. (2025); Ghareeb et al. (2025); Gu and Krenn (2024)]. A key challenge, however, lies in evaluating the generated hypotheses. While some studies leverage LLMs to evaluate novelty or interestingness [J. Zhang et al. (2024)], recent work has introduced critic agents—specialized components designed to monitor and iteratively correct outputs from other agents—into multi-agent frameworks (see Section 3.12.2.1). For instance, Ghafarollahi and Buehler (2024) demonstrated how integrating such critics enables systematic hypothesis refinement through continuous feedback mechanisms.

However, the reliability of purely model-based evaluation remains contentious. Si, Yang, and Hashimoto (2025) argued that relying on a GPM to evaluate hypotheses lacks robustness, advocating instead for human assessment. This approach was adopted in their work, where human evaluators validated hypotheses produced by their system, finding more novel LLM-produced hypotheses compared to the ones proposed by humans. Notably, Yamada et al. (2025) advanced the scope of such systems by automating the entire research ML process, from hypothesis generation to article writing. Their system’s outputs were submitted to workshops at the International Conference on Learning Representations (ICLR) 2025, with one contribution ultimately accepted. However, the advancements made by such works are currently incremental instead of unveiling new, paradigm-shifting research (see Figure 5.3).

5.4.2 Chemistry-Focused Hypotheses

In scientific fields such as chemistry and materials science, hypothesis generation requires domain intuition, mastery of specialized terminology, and the ability to reason through foundational concepts [Miret and Krishnan (2024)]. To address potential knowledge gaps in LLMs, Q. Wang et al. (2023) proposed a few-shot learning approach (see Section 3.11.1) for hypothesis generation and compared it with model fine-tuning for the same task. Their method strategically incorporates in-context examples to supplement domain knowledge while discouraging over-reliance on existing literature. For fine-tuning, they designed a loss function that penalizes possible biases—e.g., given the context “hierarchical tables challenge numerical reasoning”, the model would be penalized if it generated an overly generic prediction like “table analysis” instead of a task-specific one—when trained on such examples. Human evaluations of ablation studies revealed that GPT-4, augmented with a knowledge graph of prior research, outperformed fine-tuned models in generating hypotheses with greater technical specificity and iterative refinement of such hypotheses.

Complementing this work, Yang, Liu, Gao, Xie, et al. (2025) introduced the Moose-Chem framework to evaluate the novelty of LLM-generated hypotheses. To avoid data contamination, their benchmark exclusively uses papers published after the knowledge cutoff date of the evaluated model, GPT-4o. Ground-truth hypotheses were derived from articles in high-impact journals (e.g., Nature, Science) and validated by domain-specialized PhD researchers. By iteratively providing the model with context from prior studies, GPT-4o achieved coverage of over \(80\%\) of the evaluation set’s hypotheses while accessing only \(4\%\) of the retrieval corpus, demonstrating efficient synthesis of ideas presumably not present in its training corpus.

Figure 5.3: **Overview of LLM-based hypothesis generation**. Current methods are based on llm-sampling methods in which an LLM proposes new hypotheses. The generated hypotheses are evaluated in terms of novelty and impact either by another LLM or by a human. Then, through experimentation, the hypotheses are transformed into results which showcase that current LLMs cannot produce groundbreaking ideas, limited to their training corpus, resulting in the best cases, in incremental work. This is shown metaphorically with the puzzle. The “pieces of chemical knowledge” based on the hypothesis produced by LLMs are already present in the “chemistry puzzle”, not unveiling new parts of it.

5.4.3 Are LLMs Actually Capable of Novel Hypothesis Generation?

Automatic hypothesis generation is often regarded as the Holy Grail of automating the scientific process [Coley, Eyke, and Jensen (2020)]. However, achieving this milestone remains challenging, as generating novel and impactful ideas requires questioning current scientific paradigms [Kuhn (1962)]—a skill typically refined through years of experience—which is currently impossible for most ML systems.

Current progress in ML illustrates these limitations [Kon et al. (2025); Gu and Krenn (2024)]. Although some studies claim success in AI-generated ideas accepted at workshops in ML conferences via double-blind review [Zhou and Arel (2025)], these achievements are limited. First, accepted submissions often focus on coding tasks, one of the strongest domains for LLMs. Second, workshop acceptances are less competitive than main conferences, as they prioritize early-stage ideas over rigorously validated contributions. In chemistry, despite some works showing promise on these systems [Yang, Liu, Gao, Liu, et al. (2025)], LLMs struggle to propose functional hypotheses [Si, Hashimoto, and Yang (2025)]. Their apparent success often hinges on extensive sampling and iterative refinement, rather than genuine conceptual innovation.

As Kuhn (1962) argued, generating groundbreaking ideas demands challenging prevailing paradigms—a capability missing in current ML models (they are trained to make the existing paradigm more likely in training rather than questioning their training data), as shown in Figure 5.3. Thus, while accidental discoveries can arise from non-programmed events (e.g., Fleming’s identification of penicillin [Fleming (1929); Fleming (1964)]), transformative scientific advances typically originate from deliberate critique of existing knowledge [Popper (1959); Lakatos (1970)]. In addition, very often breakthroughs can also not be achieved by optimizing for a simple metric—as we often do not fully understand the problem and, hence, cannot design a metric.[Stanley and Lehman (2015)] Despite some publications suggesting that AI scientists already exist, such claims are supported only by narrow evaluations that yield incremental progress [Novikov et al. (2025)], not paradigm-shifting insights. For AI to evolve from research assistants into autonomous scientists, it must demonstrate efficacy in addressing societally consequential challenges, such as solving complex, open-ended problems at scale (e.g., “millennium” math problems [Carlson, Jaffe, and Wiles (2006)]).

Finally, ethical considerations become critical as hypothesis generation grows more data-driven and automated. Adherence to legal and ethical standards must guide these efforts (see Section 7.2) [The Danish National Committee on Health Research Ethics (2024)].

With a hypothesis in hand, the next step is often to run an experiment to test it.

5.5 Experiment Planning

Before a human or robot can execute any experiments, a plan must be created. Planning can be formalized as the process of decomposing a high-level task into a structured sequence of actionable steps aimed at achieving a specific goal. The term planning is often confused with scheduling and RL, which are closely related but distinct concepts. Scheduling is a more specific process focused on the timing and sequence of tasks. It ensures that resources are efficiently allocated, experiments are conducted in an optimal order, and constraints (such as lab availability, time, and equipment) are respected.[Kambhampati et al. (2023)] RL is about adapting and improving plans over time based on ongoing results.[P. Chen et al. (2022)]

5.5.1 Conventional Planning

Early experimental planning in chemistry relied on human intuition and domain expertise. One example of this is retrosynthesis. Since the 1960s, systems like logic and heuristics applied to synthetic analysis (LHASA) [Corey, Cramer III, and Howe (1972)] began automating retrosynthesis using hand-coded rules and heuristics[Warr (2014)]. Later tools, such as Chematica[Grzybowski et al. (2018)], expanded these efforts by integrating larger template libraries and optimization strategies. As reaction data grew in volume and complexity, manual rule encoding became unsustainable. Platforms like ASKCOS[Tu et al. (2025)] integrated graph neural network (GNN)s and neural classifiers to predict reactivity and suggest conditions, enabling actionable synthetic routes.

All applications, however, face the problem that planning is difficult because search spaces are combinatorially large and evaluating potential paths, in principle, requires a model that can perfectly predict the outcomes of different actions. Conventional approaches often rely on various forms of search algorithms such as breadth-first search (BFS), depth-first search (DFS), Monte Carlo tree search (MCTS) [Segler, Preuß, and Waller (2017)]. Those, however, are often still not efficient enough to tackle long-horizon planning for complex problems.

5.5.2 LLMs to Decompose Problems into Plans

GPMs, in particular LLMs, can potentially assist in planning with two modes of thinking. Deliberate (system-2-like) thinking can be used to score potential options or to decompose problems into plans. Intuitive (system-1-like) thinking can be used to efficiently prune search spaces. These two modes align with psychological frameworks known as system-1 and system-2 thinking. [Kahneman (2011)] In the system-1 thinking, LLMs support rapid decision-making by leveraging heuristics and pattern recognition to quickly narrow down options. In contrast, system-2 thinking represents a slower, more analytical process, in which LLMs solve complex tasks—such as logical reasoning and planning—by explicitly generating step-by-step reasoning. [Ji et al. (2025)]

Decomposing a goal into actionable milestones relies on this deliberate, system-2-style reasoning, enabling the model to evaluate alternatives and structure plans effectively. A variety of strategies have been proposed to improve the reasoning capabilities of LLMs during inference. Methods such as chain-of-thought (CoT) and least-to-most prompting guide models to decompose problems into interpretable steps, improving transparency and interpretability. However, their effectiveness in planning is limited by error accumulation and linear thinking patterns.[Stechly, Valmeekam, and Kambhampati (2024)] To address these limitations, recent test-time strategies such as repeat sampling and tree search have been proposed to enhance planning capabilities in LLMs. Repeated sampling allows the model to generate multiple candidate reasoning paths, encouraging diversity in thought and increasing the chances of discovering effective subgoal decompositions. [E. Wang et al. (2024)] Meanwhile, tree search methods like tree-of-thought (ToT) and reasoning via planning (RAP) treat reasoning as a structured search, also using algorithms like MCTS to explore and evaluate multiple solution paths, facilitating more global and strategic decision-making. [Hao et al. (2023)]

Beyond purely linguistic reasoning, LLMs have also been used to interpret natural-language queries and to translate them into structured planning steps, as demonstrated by systems like LLM+P[B. Liu et al. (2023)] and LLM-DP[Dagan, Keller, and Lascarides (2023)], which integrated LLMs with classical planners to convert planning problems into planning domain definition language (PDDL). LLMs have also been applied to generate structured procedures from limited observations. For example, in quantum physics, a model was trained to infer reusable experimental templates from measurement data, producing Python code that generalized across system sizes. [Arlt et al. (2024)] This demonstrates how LLMs can support scientific planning by synthesizing high-level protocols from low-level evidence, moving beyond symbolic reasoning to executable plan generation.

5.5.3 Pruning of Search Spaces

Pruning refers to the process of eliminating unlikely or suboptimal options during the search to reduce the computational burden. Because the number of potential pathways can grow exponentially, exhaustive search may be computationally intensive. Classical planners employ heuristics, value functions, or logical filters to perform pruning[Bonet and Geffner (2012)]. LLMs can emulate pruning through learned heuristics, intuitive judgment, or context-driven evaluation, [Gao et al. (2025)] reflecting system-1 thinking. Figure 5.4 illustrates how LLMs can support experimental planning by selectively pruning options. Rule-based heuristics derived from domain knowledge can automatically discard routes involving unfavorable motifs, such as chemically strained rings or complex aromatic scaffolds. Meanwhile, LLMs can emulate an expert chemist’s intuition by discarding synthetic routes that appear unnecessarily long, inefficient, or mechanistically implausible.

To further enhance planning efficacy, LLMs can be augmented with external tools that estimate the feasibility or performance of candidate plans, enabling targeted pruning of the search space before costly execution. In ChemCrow, the LLM collaborated with specialized chemical tools with knowledge about molecular and reaction properites. While ChemCrow does not explicitly generate and prune a large pool of candidate plans, these tools serve as real-time evaluators that help the model avoid unfeasible or inefficient directions during synthesis or reaction planning.

In addition to external tools, LLMs can also engage in self-correction, a reflective strategy that identifies and prunes flawed reasoning steps within their own outputs. This introspective pruning supports more robust and coherent planning by discarding faulty intermediate steps before they affect final decisions. As such, self-correction offers a lightweight yet effective mechanism for narrowing the solution space in complex reasoning tasks. At the highest level of oversight, human-in-the-loop frameworks introduce expert feedback to guide pruning decisions. The ORGANA system[Darvish et al. (2025)] integrated chemist feedback into the planning process, helping define goals, resolve ambiguities, and eliminate invalid directions.

Figure 5.4: **GPM-guided retrosynthesis route planning and pruning**. GPMs can systematically evaluate and prune retrosynthetic routes using multiple reasoning capabilities to discriminate between viable and problematic approaches. The partially overlapping arrows at the start of each route indicate multiple steps. **Route A**: This route was pruned by heuristic reasoning due to the unfavorable aromatic core construction. **Route B**: This route was selected as it successfully passes all gpm planning checks, demonstrating optimal synthetic feasibility. **Route C**: This route was pruned by external tools due to the poor region-selectivity of the oxidation step. **Route D**: This route was pruned based on learned intuition, as it represents an inefficient multistep pathway; the route could just start with phenol instead of synthesizing it.

5.5.4 Evaluation

While pruning accelerates planning, its effectiveness depends on reliable evaluation—the ability to judge whether a candidate plan is valid or promising. However, evaluating planning quality is particularly challenging in scientific fields such as chemistry and biology. Many alternative plans may achieve the same goal, so evaluation is inherently ambiguous in the absence of a comprehensive world model. In open-ended domains, evaluation is often conducted manually. For example, ChemCrow [Bran et al. (2024)] relied on expert review to assess the correctness and plausibility of generated outputs. More dynamic evaluations can be performed in simulated or real embodied environments [Song et al. (2023); Choi et al. (2024)], offering interactive feedback on feasibility. In parallel, automatic evaluation methods are emerging. For example, BioPlanner[O’Donoghue et al. (2023)] used pseudocode-based evaluation, comparing LLM-generated protocols to expert-written pseudocode representations to assess plausibility and correctness without requiring manual review or physical execution.

5.6 Experiment Execution

Once an experimental plan is available, whether from a human scientist’s idea or a sophisticated AI model, the next step is to execute it. Regardless of its source, the plan must be translated into concrete, low-level actions for execution. One of the main challenges of lab automation is to convert the high-level and abstract experimental plan into real-world operations carried out by the experimental hardware (liquid-handing systems, robotic arms, instruments, etc.).

It is worth noting that, despite their methodological differences, executing experiments in silico (running simulations or code) and in vitro are not fundamentally different—both follow an essentially identical workflow: Plan \(\rightarrow\) Instructions \(\rightarrow\) Execution \(\rightarrow\) Analysis. In a computer simulation, a researcher writes a program (plan), which is then compiled or interpreted into machine code (instructions) for the central processing unit (CPU), executed to produce data, and finally the outputs are analyzed. In an automated laboratory, the scientist specifies a protocol (plan), which must be translated into instrument commands (instructions), executed on a robotic platform, followed by the analysis of sensor data or assay results. Both scenarios require careful translation of abstract steps into concrete actions, as well as further decision-making based on the acquired results.

The execution of in silico experiments can be reduced to two essential steps: preparing input files and running the computational code; GPMs can be used in both steps.[Z. Liu, Chai, and Li (2025); Mendible-Barreto et al. (2025); Zou et al. (2025); Campbell et al. (2025)] Jacobs and Pollice (2025) found that using a combination of fine-tuning, CoT and RAG (see Section 3.11) can improve the performance of LLMs in generating executable input files for the quantum chemistry software ORCA[Neese (2022)], while Gadde et al. (2025) created AutosolvateWeb, an LLM-based platform that assists users in preparing input files for quantum mechanics/molecular mechanics (QM/MM) simulations of explicitly solvated molecules and running them on a remote computer. Examples of GPM-based autonomous agents (see Section 3.12) capable of performing the entire computational workflow (i.e., preparing inputs, executing the code, and analyzing the results) are MDCrow [Campbell et al. (2025)] (for molecular dynamics) and El Agente Q [Zou et al. (2025)] (for quantum chemistry).

GPMs can also assist in automating in vitro experiments. We can draw parallels from programming language paradigms—compiled vs. interpreted (see Figure 5.5 A)—to better understand how GPMs can be useful in different approaches of experiment automation. In compiled languages (like C++ or Fortran), the entire code is converted ahead of time by another program called the “compiler” into binary machine code, which is directly executable by the hardware. In interpreted languages (like Python or JavaScript), a program called the “interpreter” reads the instructions line-by-line during runtime, translating and executing them on the fly. Compiled languages offer high performance and early error detection, making them ideal for performance-critical systems, but they require a separate compilation step and are less flexible during development. Interpreted languages are easier to use, debug, and modify on the fly, which makes them great for rapid development and scripting, but they generally run slower and catch errors only at runtime. Similarly, we can broadly categorize different approaches to experiment automation into two different groups: “compiled automation” and “interpreted automation” (see Figure 5.5 B). In the compiled approach, the entire protocol is translated—either by a human or a GPM—to low-level instructions before execution, while in interpreted automation, the GPM plays a central role, acting as the “interpreter” and executing the protocol step by step. As we show below, it can be instructive to use this perspective when discussing approaches to automate experiment execution with GPMs.

Figure 5.5: **Programming languages vs. lab automation. A) programming paradigms**: In compiled languages, the entire source code is translated ahead of time to machine code by the compiler. This stand-alone code is then given to the operating system (OS), which is responsible for scheduling and distributing tasks to the hardware. In interpreted languages, the interpreter reads and translates each line of the source code to machine code and hands it to the OS for execution. **B) automation paradigms**: In the compiled approach, a gpm formalizes the protocol, a compiler, such as the chempiler(Steiner et al. 2019), translates the formalized protocol to hardware-specific low-level steps, which the controller then executes—a central hub tasked with scheduling and distributing commands to chemical hardware. In the interpreted approach, a GPM, acting as the interpreter, first breaks down the protocol into specific steps, then sends them (via an application programming interface (API)) for execution one by one. The strength of interpreted systems is dynamic feedback: after the execution of each step, the GPM receives a signal (e.g., data, errors), which can influence its behavior for the next steps.

5.6.1 Compiled Automation

In the case of “compiled automation”, the experiment protocol needs to be formalized in a high-level or domain-specific language (DSL) that describes exactly what operations to perform in what order. A chemical compiler (or “chempiler” [Steiner et al. (2019)]) then converts this high-level protocol into low-level code for the specific lab hardware, which is then executed by robotic instruments, orchestrated by a controller (refer to the caption of Figure 5.5 B).

5.6.1.1 Protocol Languages

While Python-based scripts are frequently used as the de facto protocol language due to Python’s accessibility and flexibility,[Wierenga et al. (2023); Vriza et al. (2023); C. Wang et al. (2025)] specialized languages (DSLs) have also been developed to provide more structured and semantically rich representations of experimental procedures.[Z. Wang et al. (2022); Ananthanarayanan and Thies (2010); Strateos (2023); Park et al. (2023)] One of the prominent examples of such languages is chemical description language (\chiDL)[Group (2023)], developed as part of the Chemputer architecture [Steiner et al. (2019); Mehr et al. (2020); Hammer et al. (2021)]. \chiDL uses a JSON-like format, and the experimental protocol is described by defining Reagents, Vessels, etc, and using abstract chemical commands such as Add, Stir, Filter, etc. In the next step, the Chempiler software takes this \chiDL script and a description of the physical connectivity and composition of the automated platform as a graph and translates it into chemical assembly language (ChASM) which is specific to the platform (akin to machine code). In practice, \chiDL has been used to automate multi-step organic syntheses with yields comparable to manual experiments.[Mehr et al. (2020)]

Developing experimental protocols in a formal language is a non-trivial task, often requiring specialized coding expertise. Within the compiled approach, the role of the GPM is to translate natural-language protocols into their formalized, machine-readable counterparts.[Sardiña, García-González, and Luaces (2024); Jiang et al. (2024); Conrad et al. (2025); Inagaki et al. (2023)] Vaucher et al. (2020) used an encoder-decoder transformer model to convert English experimental procedures to structured sequences of pre-defined synthesis actions (e.g., MakeSolution, SetTemperature, Extract). They pre-trained the model on \(2\)M sentence-action pairs extracted by a rule-based natural-language processing (NLP) algorithm and then fine-tuned it on manually annotated samples to improve accuracy. The model achieved exact sentence-pair matching in \(61\%\) of the test samples and had more than \(75\%\) overlap in \(82\%\) of them. Although this approach accelerates automated protocol extraction from chemical literature, the output format is not directly suitable for execution.

Pagel, Jirásek, and Cronin (2024) introduced a multi-agent workflow (based on GPT-4) that can address this issue and convert unstructured chemistry papers into executable code. The first agent extracts all synthesis-relevant text, including supporting information; a procedure agent then sanitizes the data and tries to fill the gaps from chemical databases (using RAG); another agent translates procedures into \chiDL and simulates them on virtual hardware; finally, a critique agent cross-checks the translation and fixes errors.

The example above shows one of the strengths of the compiled approach: it allows for pre-validation. The protocol can be simulated or checked for any errors before running on the actual hardware, ensuring safety. Another example of LLM-based validators for chemistry protocols is CLAIRify.[Yoshikawa et al. (2023)] Leveraging an iterative prompting strategy, it uses GPT-3.5 to first translate the natural-language protocol into \chiDL script, then automatically verifies its syntax and structure, identifies any errors, appends those errors to the prompt, and prompts the LLM again—iterating this process until a valid \chiDL script is produced.

Similar to how compiled software can be recompiled for different platforms, compiled automation is hardware-agnostic: by using appropriate compilation, a well-defined protocol can—at least in principle—be run on different robotic systems as long as they have the required capabilities.[Rauschen et al. (2024); Strieth-Kalthoff et al. (2024); Wilbraham, Mehr, and Cronin (2021)] In practice, however, inconsistencies in hardware interfaces and software standards across the lab automation community make cross-platform execution challenging.

The main limitations of compiled approaches are the flip side of their strengths: low flexibility and adaptability. Any logic or decision-making must either be explicitly encoded within the protocol—necessitating meticulous scripting—or delegated to an external control layer.[M. Mehr, Caramelli, and Cronin (2023); Leonov et al. (2024)] If something unexpected occurs (a pump clogging, a reaction taking longer than expected), the pre-compiled protocol cannot easily adjust in real-time, and human intervention or a complete recompile might be needed.

5.6.2 Interpreted Automation

Interpreted programming languages support higher levels of abstraction, enabling the use of more general and flexible command structures. Similarly, since GPMs can translate high-level goals into concrete steps[Ahn et al. (2022); W. Huang et al. (2022)], they can act as an “interpreter” between the experimental intent and lab hardware. For instance, given an instruction “titrate the solution until it turns purple”, a GPM agent (see Section 3.12) can break it down into smaller steps and convert each step to executable code, allowing it to perform incremental additions of titrant and read a color sensor, looping until the condition is met. This conversion of concrete steps to code happens at runtime; it is not pre-compiled. We refer to such systems as “interpreted automation” systems. In contrast to the deterministic, preplanned nature of compiled systems, interpreted architectures introduce real-time decision-making. As each action completes, the system collects sensor data (instrument readings, spectra, error messages, etc.) which the agent analyzes and decides on the next action. This allows for dynamic branching and conditional logic during the experiment execution.

Coscientist [Boiko et al. (2023)] is an LLM-based chemistry assistant built around GPT-4 that can autonomously design and execute experiments. It can take high-level goals and call tools to write code in real-time in order to control an Opentrons OT-2 liquid-handling robot. The architecture included a web-search module, a documentation module (to read instrument manuals), a Python execution module (to run generated code in a sandbox), and an experiment execution module that sends code to actual lab equipment. If an error occurred, the system would get feedback and GPT-4 would debug its own code. Coscientist successfully planned and executed multistep syntheses with minimal human intervention. For example, it efficiently optimized a palladium cross-coupling reaction with minimal human input, outperforming a standard Bayesian optimizer baseline in finding high-yield conditions.

Another example is ChemCrow [Bran et al. (2024)], a GPT-4-based agent augmented with \(18\) expert-designed tools for tasks like compound lookup, spectral analysis, and retrosynthesis. ChemCrow can perform tasks across synthesis planning, drug discovery, and materials design by invoking external software for things like retrosynthesis, property prediction, database queries, etc. It planned and executed the syntheses of an insect repellent, N,N-diethyl-meta-toluamide (DEET), and three different organocatalysts and even guided the discovery of a new chromophore dye.

The interpreted paradigm is highly generalizable; in principle, the same LLM agent controlling a chemistry experiment could be re-purposed to a biology or materials experiment with minimal reprogramming because it operates at the level of intent and semantic understanding. However, fully autonomous labs featuring interpreted automation are still experimental themselves—ensuring their reliability and accuracy remains an open challenge.

Despite being labeled as “autonomous,” both systems mentioned above often need prompting nudges and human correction. In addition, these models can replicate known procedures and use databases, but they lack an understanding of mechanisms or underlying principles. Another issue is full reproducibility and long-term experiment tracking. Since the GPM’s response might not be deterministic, small changes in prompts can yield different results and closed-source models like GPT-4 can change over time. Hallucinations remain a risk, especially in planning complex or sensitive reactions. In addition, allowing an agent to control hardware brings safety considerations; the flexibility of GPMs means that they can devise unanticipated actions. Designing safety nets for these systems is an active area of research. (see Section 7.2)

5.6.3 Hybrid Approaches

Between the two extremes of fully compiled vs. fully interpreted automation lies a hybrid approach that seeks to combine the best of both paradigms: the safety and reliability of compiled protocols and the AI-driven flexibility of interpreted systems.

The key difference from purely interpreted systems is that during each experiment run, the plan is fixed, ensuring safety and reproducibility, but between runs, the plan can dynamically change based on the GPM’s interpretation of results. Once the initial plan (ideally devised by the same GPM in a previous step) is provided to a hybrid system, instead of reducing it to smaller steps and directly sending the instructions to a laboratory one at a time, the protocol is first formalized—i.e., it is translated to a formal machine-readable format such as \chiDL. Once validated, the formalized protocol is compiled and executed. After the completion of execution, the GPM receives the results and decides what experiment to perform next. This cycle repeats, creating an autonomous optimization or discovery loop.

This hybrid strategy is attractive because it provides a safety net against mistakes made by the GPM interpreter; any generated procedure must pass through a formalization and verification stage before real execution, and therefore, erroneous or hallucinated steps can be caught. For example, if the interpreter hallucinated adding 1000 mL of a solvent but the hardware has only 100 mL capacity, it can be flagged as an error.

ORGANA [Darvish et al. (2025)] is an LLM-based robotic assistant following this hybrid paradigm. It allows human chemists to describe their experimental goal in natural language. The system can converse with the user to clarify ambiguous requests (the agent would ask “do you mean X or Y?” if the instructions are unclear). Once the goal is understood, it uses CLAIRify [Yoshikawa et al. (2023)] to convert and validate the natural-language description of a chemistry experiment into a \chiDL script, which can be executed on a compatible platform. In one case, ORGANA carried out a multistep electrochemistry procedure—polishing electrodes, running an experiment, and analyzing the data—involving 19 substeps that it coordinated in parallel. If an unexpected observation occurred (e.g., a solution does not change color when expected), the system can notice via image analysis and modify the plan or alert the user. In user studies, ORGANA significantly reduced the manual labor and frustration for chemists, who could offload tedious tasks and trust the agent to handle low-level decisions.

5.6.4 Comparison and Outlook

While compiled paradigms continue to provide the backbone for reliable automation, interpreted paradigms will drive exploratory research, where adaptability is key. Hybrid systems are likely to be the bridge that brings AI into mainstream lab operations, ensuring that flexibility comes with accountability. A brief comparison of the three mentioned approaches is given in Table 5.1.

Table 5.1: Comparison of the Compiled, Interpreted, and Hybrid Automation Paradigms. Each approach has its strengths and weaknesses. Compiled systems favor reliability, interpreted systems allow for more flexibility, while hybrid systems try to strike a balance.

Feature	Compiled	Interpreted	Hybrid
Flexibility	Low	High	Medium
Adaptivity	None	Real-time	Iterative
Reproducibility	High	Medium	High
Safety	High	Low	Medium
Setup Overhead	Medium	High	High
Industrial Readiness	Low	Low	Low

While we are essentially witnessing the rise of self-driving laboratories, autonomous experimentation systems present a range of challenges.[Tom et al. (2024); Seifrid et al. (2022)] First, translating high-level natural-language goals into precise laboratory actions remains difficult, as GPMs can misinterpret ambiguous instructions, leading to invalid or unsafe procedures. This problem is compounded by the lack of universally adopted standards for protocol formalization; while languages like \chiDL show promise, inconsistencies in abstraction, device compatibility, and community uptake limit interoperability. Real-time execution adds further complexity, as systems must detect and respond to failures or unexpected behaviors; however, general-purpose validation mechanisms and recovery strategies remain underdeveloped. Hardware integration is another bottleneck; current commercial robotic platforms are prohibitively expensive and lab environments often rely on a patchwork of instruments with proprietary interfaces, and building robust, unified control layers demands considerable engineering overhead. Another challenge is multi-modality in chemistry; chemists use a wide variety of data (e.g., spectra, TLC plates, SEM images). Without integrating these forms of output, models will be limited in their decision-making. Finally, ensuring reproducibility and regulatory compliance requires that every step be logged, validated, and traceable at the level required for clinical or industrial adoption (see Section 7.2. These challenges must be addressed in tandem to move from experimental demonstrations toward reliable, scalable, and trustworthy autonomous laboratories.

5.7 Data Analysis

The analysis of spectroscopic and experimental data in chemistry remains a predominantly manual process. Even seemingly straightforward steps, such as plotting or summarizing results, demand repeated manual intervention.

One key challenge that makes automation particularly difficult is the extreme heterogeneity of chemical data sources. Laboratories often rely on a wide variety of instruments, some of which are decades old, rarely standardized, or unique in configuration.[Jablonka, Patiny, and Smit (2022)] These devices output data in incompatible, non-standardized, or poorly documented formats, each requiring specialized processing pipelines. Despite efforts like JCAMP-DX [McDonald and Wilks (1988)], standardization attempts remain scarce and have generally failed to gain widespread use. This diversity makes rule-based or hard-coded solutions largely infeasible, as they cannot generalize across the long tail of edge cases and exceptions found in real-world workflows.

However, this exact complexity makes data analysis in chemistry a promising candidate for GPMs. They are designed to operate flexibly across diverse tasks and formats, relying on implicit knowledge captured from broad training data. In other domains, Narayan et al. (2022) showed that models like GPT-3 DaVinci can already perform classical data processing tasks such as cleaning, transformation, and error detection through prompting alone. Kayali et al. (2024) introduced Chorus that shows that LLMs can analyze heterogeneous tabular data without task-specific training. Chorus demonstrates that by converting tables into a standardized text format and using zero-shot prompting (i.e., prompts with no examples), LLMs can flexibly analyze tables even when they differ in structure, column names, or data types.

Figure 5.6: **Static conventional data analysis workflow vs. dynamic GPM generated workflow**. The chemical analysis can be done with a variety of possible instruments and techniques, resulting in a large number of possible output data formats. The GPM can use these diverse, raw data and process it into easy-to-understand plots, analysis and reports. A hard-coded workflow, in contrast, is specifically made to analyze one specific data format and spectra and produces a fixed output format, e.g., the simplified molecular input line entry system (SMILES) of the analyzed molecule.

5.7.1 Prompting

Initial evaluations demonstrated that GPMs can support basic data analysis workflows. [Fu et al. (2025)] For example, in chemistry, this enabled the classification of X-ray photoelectron spectroscopy (XPS) signals [Curtò et al. (2024)] based on peak positions, intensities, or characteristic spectral patterns).

Spectroscopic data are not always available in structured textual form. In many practical cases, it appears as raw plots or images, making direct interpretation by vision language model (VLM)s a more natural starting point for automated analysis. A broad assessment of VLM-based spectral analysis was introduced with the MaCBench benchmark [Alampara et al. (2024)], which systematically evaluates how VLMs interpret experimental data in chemistry and materials science—including various types of spectra such as infrared spectroscopy (IR), nuclear magnetic resonance (NMR), and X-ray diffraction (XRD)q—directly from images. They showed that while VLMs can correctly extract isolated features from plots, the performance substantially drops in tasks requiring deeper spatial reasoning. To overcome these limitations, Kawchak (2024) explored two-step pipelines that decouple visual perception from chemical reasoning. First, the model interprets each spectrum individually (e.g., converting IR, NMR, or mass spectrometry (MS) images into textual peak descriptions), and second, a LLM analyzes these outputs to propose a molecular structure based on the molecular formula.

5.7.2 Agentic Systems

Beyond zero-shot prompting of GPMs, one can develop agentic systems that combine multiple analysis steps end-to-end. In this regard, Ghareeb et al. (2025) developed Robin—a multi-agent system for assisting biological research with hypothesis generation (see Figure 5.3) and experimental analysis. The data analysis agent Finch performs autonomous analysis of raw or preprocessed experimental data, such as ribonucleic acid (RNA) sequencing and flow cytometry. Given a user prompt (e.g., “RNA sequencing differential expression analysis”), Finch executes code in a Jupyter notebook to process the data, apply relevant statistical methods, and generate interpretable outputs. For flow cytometry, this includes gating strategies and significance testing, while for RNA sequencing, it encompasses differential expression and gene ontology enrichment analysis. Currently, only these two data types are supported, and expert-designed prompts are still required to ensure reliable results.

Recent work extends agentic systems beyond single-step data evaluation toward executing and optimizing entire workflows. Mandal et al. (2024) introduced AILA (Artificially Intelligent Lab Assistant) utilizing LLM-agents to plan, code, execute, and revise complete atomic force microscopy (AFM) analysis pipelines. The system handles tasks such as image processing, defect detection, clustering, and extraction of physical parameters. Compared to systems like Finch, AILA shifts the focus from generating summaries to performing and improving full experimental analyses with minimal user input while maintaining transparency and reproducibility through code and reports.

5.7.3 Current Limitations

While GPMs offer promising capabilities for automating scientific data analysis, several limitations remain. Recent evaluations such as FMs4Code [Tian et al. (2024)] have shown that even state-of-the-art models like GPT-4-Turbo and Claude 2 frequently produce syntactically correct but semantically incorrect code when tasked with common data analysis steps, such as reading files, applying filters, or generating plots. Typical issues include incorrect column usage, or inconsistent output formatting.

These technical shortcomings are reinforced by the model’s sensitivity to prompt formulation. As demonstrated by Yan and He (2020) and Alampara et al. (2024), minor changes in wording or structure can lead to significantly different outputs, highlighting a lack of robustness in prompt-based control.

Together, these findings suggest that while foundation models can generalize across diverse data formats and analysis types, their current performance is not yet sufficient for fully autonomous use in scientific analysis settings. Robust prompting strategies, post-generation validation, and human oversight remain essential components in practice.

5.8 Reporting

To share insights obtained from data analysis, one often converts them into scientific reports. Also, in this step, GPMs can take a central role, which we discuss in the following.

Reporting refers to converting scientific results into shareable reports, scientific publications, blogs, and other forms of content. This section describes two main applications of LLMs in scientific reporting: converting data into explanations and the first steps towards using these models as fully-fledged writing assistants.

5.8.1 From Data to Explanation

The lack of explainability of ML predictions generates skepticism among experimental chemists[Wellawatte and Schwaller (2025)], hindering the wider adoption of such models.[Wellawatte, Seshadri, and White (2022)] One promising approach to address this challenge is to convey explanations of model predictions in natural language. An approach proposed by Wellawatte and Schwaller (2025) is to couple LLMs with feature importance analysis tools, such as shapley additive explanations (SHAP) or local interpretable model-agnostic explanations (LIME). In this framework, LLMs can additionally interact with tools such as RAG over arxiv to provide evidence-based explanations.

5.8.2 Writing Assistance

When considering ML-based assistance in scientific writing, we can distinguish two primary modes: systems that aid authors during the active writing process and tools that optimize or refine scientific articles after initial drafting.

The former refers to the use of writing copilots that can suggest syntax improvement, identify text redundancies,[Khalifa and Albadawy (2024)] caption figures and tables[Hsu, Giles, and Huang (2021); Selivanov et al. (2023)], or provide caption-figure match evaluation[Hsu et al. (2023)], but also more specific applications like writing alt-text (descriptive text that explains the meaning and purpose of an image in digital content)[Singh, Wang, and Bragg (2024)].

Under the latter mode, GPM can be used to assist non-native English speakers with scientific writing [Giglio and Costa (2023)]. It could even allow authors to write in their native language and use GPM for communicating scientific results in English.

Another application of LLM is to assist with completing checklists before submitting a publication. For example, Goldberg et al. (2024) benchmark the use of LLMs in completing the author checklist for the Conference on Neural Information Processing Systems (NeurIPS) 2025. They concluded that \(70\%\) of the authors found the LLM-assistant useful, with the same fraction indicating they would revise their own checklist based on the model feedback.

5.8.3 Vision

Few have ventured into fully automating the writing process.[Yamada et al. (2025)] While at its inception, reporting using GPM has tremendous potential. In Figure 5.7 we showcase how the future of reporting could look like if we were to integrate GPM at each step of the process.

Figure 5.7: **Vision for GPM in reporting, a visualization of the scientific writing process**. gpms can be used at every stage of the process. For creating the pre-print, we can utilize the multimodal capabilities of these models to write detailed captions for figures. For the peer-review process, we can harness the ability of GPMs to summarize and prioritize information (e.g., design a time-efficient plan to address the peer review). When converting a document from a peer-reviewed pre-print, we often need to implement the publisher’s requirements. In this case, we can make use of agentic systems that would assist with minor text fixes or document restructuring.

An idea entertained by Li et al. (2023) in the context of education is personalized writing. However, it is still widely unexplored in its goal: to make science accessible to everyone. A personalized model that learns user preferences and domain expertise can be used to deliver the message of a scientific article in simpler terms. As a result, we might observe a rise in cross-domain scientific collaborations and a rising interest in science.

Ahn, Michael, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, et al. 2022. “Do as i Can, Not as i Say: Grounding Language in Robotic Affordances.” arXiv Preprint. https://doi.org/10.48550/arXiv.2204.01691.

Ai, Qianxiang, Fanwang Meng, Jiale Shi, Brenden Pelkie, and Connor W Coley. 2024. “Extracting Structured Data from Organic Synthesis Procedures Using a Fine-Tuned Large Language Model.” Digital Discovery 3 (9): 1822–31. https://doi.org/10.1039/d4dd00091a.

Alampara, Nawaf, Mara Schilling-Wilhelmi, Martiño Rı́os-Garcı́a, Indrajeet Mandal, Pranav Khetarpal, Hargun Singh Grover, NM Krishnan, and Kevin Maik Jablonka. 2024. “Probing the limitations of multimodal language models for chemistry and materials research.” arXiv Preprint. https://doi.org/10.48550/arXiv.2411.16955.

Ananthanarayanan, Vaishnav, and William Thies. 2010. “BioCoder: A Programming Language for Standardizing and Automating Biology Protocols.” Journal of Biological Engineering 4: 13. https://doi.org/10.1186/1754-1611-4-13.

Ansari, Mehrad, and Seyed Mohamad Moosavi. 2024. “Agent-Based Learning of Materials Datasets from the Scientific Literature.” Digital Discovery 3 (12): 2607–17. https://doi.org/10.1039/D4DD00252K.

Arlt, Sören, Haonan Duan, Felix Li, Sang Michael Xie, Yuhuai Wu, and Mario Krenn. 2024. “Meta-Designing Quantum Experiments with Language Models.” arXiv Preprint arXiv: 2406.02470. https://doi.org/10.48550/arXiv.2406.02470.

Beltagy, Iz, Kyle Lo, and Arman Cohan. 2019. “SciBERT: A Pretrained Language Model for Scientific Text.” Conference on Empirical Methods in Natural Language Processing. https://doi.org/10.18653/v1/D19-1371.

Boiko, Daniil A, Robert MacKnight, Ben Kline, and Gabe Gomes. 2023. “Autonomous chemical research with large language models.” Nature 624 (7992): 570–78. https://doi.org/10.1038/s41586-023-06792-0.

Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. “Enriching Word Vectors with Subword Information.” Transactions of the Association for Computational Linguistics 5: 135–46. https://doi.org/10.1162/tacl_a_00051.

Bonet, Blai, and Hector Geffner. 2012. “Action Selection for MDPs: Anytime AOversus UCT.” Proceedings of the AAAI Conference on Artificial Intelligence 26 (1): 1749–55. https://doi.org/10.1609/aaai.v26i1.8369.

Bran, Andres M., Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D. White, and Philippe Schwaller. 2024. “Augmenting Large Language Models with Chemistry Tools.” Nature Machine Intelligence 6 (5). https://doi.org/10.1038/s42256-024-00832-8.

Campbell, Quintina, Sam Cox, Jorge Medina, Brittany Watterson, and Andrew D. White. 2025. “MDCrow: Automating Molecular Dynamics Workflows with Large Language Models.” arXiv Preprint arXiv:2502.09565. https://doi.org/10.48550/arXiv.2502.09565.

Carlson, James, Arthur Jaffe, and Andrew Wiles, eds. 2006. The Millennium Prize Problems. Providence, RI: American Mathematical Society & Clay Mathematics Institute.

Chan, Jun Shern, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, et al. 2024. “Mle-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering.” arXiv Preprint arXiv:2410.07095. https://doi.org/10.48550/arXiv.2410.07095.

Chen, Kexin, Hanqun Cao, Junyou Li, Yuyang Du, Menghao Guo, Xin Zeng, Lanqing Li, Jiezhong Qiu, Pheng Ann Heng, and Guangyong Chen. 2024. “An Autonomous Large Language Model Agent for Chemical Literature Data Mining.” arXiv Preprint arXiv: 2402.12993. https://doi.org/10.48550/arXiv.2402.12993.

Chen, Pengzhan, Jiean Pei, Weiqing Lu, and Mingzhen Li. 2022. “A deep reinforcement learning based method for real-time path planning and dynamic obstacle avoidance.” Neurocomputing 497: 64–75. https://doi.org/10.1016/j.neucom.2022.05.006.

Choi, Jae-Woo, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, and Minsu Jang. 2024. “Lota-Bench: Benchmarking Language-Oriented Task Planners for Embodied Agents.” arXiv Preprint arXiv:2402.08178. https://doi.org/10.48550/arXiv.2402.08178.

Christofidellis, Dimitrios, Giorgio Giannone, Jannis Born, Ole Winther, Teodoro Laino, and Matteo Manica. 2023. “Unifying Molecular and Textual Representations via Multi-Task Language Modelling.” International Conference on Machine Learning, ICML 2023, Proceedings of machine learning research, 202: 6140–57. https://doi.org/10.48550/arXiv.2301.12586.

Chu, Johan S. G., and James A. Evans. 2021. “Slowed Canonical Progress in Large Fields of Science.” Proceedings of the National Academy of Sciences 118 (41). https://doi.org/10.1073/pnas.2021636118.

Clune, Jeff. 2019. “AI-GAs: AI-Generating Algorithms, an Alternate Paradigm for Producing General Artificial Intelligence.” arXiv Preprint arXiv: 1905.10985. https://doi.org/10.48550/arXiv.1905.10985.

Coley, Connor W, Natalie S Eyke, and Klavs F Jensen. 2020. “Autonomous Discovery in the Chemical Sciences Part i: Progress.” Angewandte Chemie International Edition 59 (51): 22858–93. https://doi.org/10.1002/anie.201909987.

Conrad, Stefan, Philipp Auth, Tom Masselter, and Thomas Speck. 2025. “Lowering the Entrance Hurdle for Lab Automation: An Artificial Intelligence‐supported, Interactive Robotic Arm for Automated, Repeated Testing Procedures.” Advanced Intelligent Systems. https://doi.org/10.1002/aisy.202401086.

Corey, Elias J, Richard D Cramer III, and W Jeffrey Howe. 1972. “Computer-assisted synthetic analysis for complex molecules. Methods and procedures for machine generation of synthetic intermediates.” Journal of the American Chemical Society 94 (2): 440–59. https://doi.org/10.1021/ja00757a022.

Curtò, J. de, I. de Zarzà, Gemma Roig, and Carlos T. Calafate. 2024. “Large Language Model-Informed x-Ray Photoelectron Spectroscopy Data Analysis.” Signals 5 (2): 181–201. https://doi.org/10.3390/signals5020010.

Dagan, Gautier, Frank Keller, and Alex Lascarides. 2023. “Dynamic Planning with a Llm.” arXiv Preprint arXiv:2308.06391. https://doi.org/10.48550/arXiv.2308.06391.

Dagdelen, John, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S Rosen, Gerbrand Ceder, Kristin A Persson, and Anubhav Jain. 2024. “Structured Information Extraction from Scientific Text with Large Language Models.” Nature Communications 15 (1): 1418. https://doi.org/10.1038/s41467-024-45563-x.

Darvish, Kourosh, Marta Skreta, Yuchi Zhao, Naruki Yoshikawa, Sagnik Som, Miroslav Bogdanovic, Yang Cao, et al. 2025. “ORGANA: A Robotic Assistant for Automated Chemistry Experimentation and Characterization.” Matter 8 (2). https://doi.org/10.1016/j.matt.2024.10.015.

Du, Yilun, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. “Improving Factuality and Reasoning in Language Models Through Multiagent Debate.” Forty-First International Conference on Machine Learning. https://doi.org/10.48550/arXiv.2305.14325.

Fleming, Alexander. 1929. “On the Antibacterial Action of Cultures of a Penicillium, with Special Reference to Their Use in the Isolation of b. Influenzae.” British Journal of Experimental Pathology 10 (3): 226–36. https://www.jstor.org/stable/4452419.

———. 1964. “Penicillin.” In Nobel Lectures, Physiology or Medicine 1942–1962, 83–93. Amsterdam: Elsevier. https://www.nobelprize.org/uploads/2018/06/fleming-lecture.pdf.

Fu, Li, Qingwei Zhou, Meiqing Jin, and Weihong Wu. 2025. “Large Language Models as Spectrographic Assistants: Opportunities and Challenges in Laboratory Data Analysis.” Environmental Chemistry and Safety, April. https://doi.org/10.26599/ecs.2025.9600002.

Gadde, Rohit S. K., Sreelaya Devaguptam, Fangning Ren, Rajat Mittal, Lechen Dong, Yao Wang, and Fang Liu. 2025. “Chatbot-Assisted Quantum Chemistry for Explicitly Solvated Molecules.” Chemical Science 16 (9): 3852–64. https://doi.org/10.1039/D4SC08677E.

Gao, Yunfan, Yun Xiong, Yijie Zhong, Yuxi Bi, Ming Xue, and Haofen Wang. 2025. “Synergizing Rag and Reasoning: A Systematic Review.” arXiv Preprint arXiv:2504.15909. https://doi.org/10.48550/arXiv.2504.15909.

Ghafarollahi, Alireza, and Markus J. Buehler. 2024. “SciAgents: Automating Scientific Discovery Through Bioinspired Multi-Agent Intelligent Graph Reasoning.” Advanced Materials, December. https://doi.org/10.1002/adma.202413523.

Ghareeb, Ali Essam, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Jon M. Laurent, Muhammed T. Razzak, Andrew D. White, Michaela M. Hinks, and Samuel G. Rodriques. 2025. “Robin: A Multi-Agent System for Automating Scientific Discovery.” arXiv Preprint arXiv: 2505.13400. https://doi.org/10.48550/arXiv.2505.13400.

Giglio, Auro Del, and Mateus Uerlei Pereira da Costa. 2023. “The Use of Artificial Intelligence to Improve the Scientific Writing of Non-Native English Speakers.” Revista Da Associação Médica Brasileira 69 (9): e20230560. https://doi.org/10.1590/1806-9282.20230560.

Goldberg, Alexander, Ihsan Ullah, Thanh Gia Hieu Khuong, Benedictus Kent Rachmat, Zhen Xu, Isabelle Guyon, and Nihar B. Shah. 2024. “Usefulness of LLMs as an Author Checklist Assistant for Scientific Papers: NeurIPS’24 Experiment.” arXiv Preprint arXiv: 2411.03417. https://doi.org/10.48550/arXiv.2411.03417.

Gottweis, Juraj, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, et al. 2025. “Towards an AI Co-Scientist.” Arxiv Preprint arXiv:2502.18864, February. https://doi.org/10.48550/arXiv.2502.18864.

Group, Cronin. 2023. “XDL 2.0 Standard Specification.” https://gitlab.com/croningroup/chi-dl-specification.

Grzybowski, Bartosz A, Sara Szymkuć, Ewa P Gajewska, Karol Molga, Piotr Dittwald, Agnieszka Wołos, and Tomasz Klucznik. 2018. “Chematica: a story of computer code that started to think like a chemist.” Chem 4 (3): 390–98. https://doi.org/10.1016/j.chempr.2018.02.024.

Gu, Xuemei, and Mario Krenn. 2024. “Interesting Scientific Idea Generation Using Knowledge Graphs and LLMs: Evaluations with 100 Research Group Leaders.” arXiv Preprint arXiv: 2405.17044. https://doi.org/10.48550/arXiv.2405.17044.

———. 2025. “Forecasting High-Impact Research Topics via Machine Learning on Evolving Knowledge Graphs.” Machine Learning: Science and Technology 6 (2): 025041. https://doi.org/10.1088/2632-2153/add6ef.

Guo, Jiang, A. Santiago Ibanez-Lopez, Hanyu Gao, Victor Quach, Connor W. Coley, Klavs F. Jensen, and Regina Barzilay. 2021. “Automated Chemical Reaction Extraction from Scientific Literature.” Journal of Chemical Information and Modeling 62 (9): 2035–45. https://doi.org/10.1021/acs.jcim.1c00284.

Gupta, Sonakshi, Akhlak Mahmood, Pranav Shetty, Aishat Adeboye, and Rampi Ramprasad. 2024. “Data Extraction from Polymer Literature Using Large Language Models.” Communications Materials 5 (1): 269. https://doi.org/10.1038/s43246-024-00708-9.

Hammer, Alexander J. S., Andrei I. Leonov, Nicholas L. Bell, and Leroy Cronin. 2021. “Chemputation and the Standardization of Chemical Informatics.” JACS Au 1 (10): 1572–87. https://doi.org/10.1021/jacsau.1c00303.

Hao, Shibo, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. “Reasoning with language model is planning with world model.” arXiv Preprint arXiv:2305.14992. https://doi.org/10.48550/arXiv.2305.14992.

Hira, Kausik, Mohd Zaki, Dhruvil Sheth, NM Anoop Krishnan, et al. 2024. “Reconstructing the Materials Tetrahedron: Challenges in Materials Information Extraction.” Digital Discovery 3 (5): 1021–37. https://doi.org/10.1039/d4dd00032c.

Hsu, Ting-Yao, C Lee Giles, and Ting-Hao’Kenneth’Huang. 2021. “SciCap: Generating captions for scientific figures.” arXiv Preprint arXiv:2110.11624. https://doi.org/10.48550/arXiv.2110.11624.

Hsu, Ting-Yao, Chieh-Yang Huang, Ryan Rossi, Sungchul Kim, C. Lee Giles, and Ting-Hao K. Huang. 2023. “GPT-4 as an Effective Zero-Shot Evaluator for Scientific Figure Captions.” arXiv Preprint arXiv: 2310.15405. https://doi.org/10.48550/arXiv.2310.15405.

Hu, Shengran, Cong Lu, and Jeff Clune. 2024. “Automated Design of Agentic Systems.” arXiv Preprint arXiv: 2408.08435. https://doi.org/10.48550/arXiv.2408.08435.

Huang, Qian, Jian Vora, Percy Liang, and J. Leskovec. 2023. “MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation.” International Conference on Machine Learning. https://doi.org/10.48550/arXiv.2310.03302.

Huang, Shu, and Jacqueline M Cole. 2022. “BatteryBERT: A Pretrained Language Model for Battery Database Enhancement.” Journal of Chemical Information and Modeling 62 (24): 6365–77.

Huang, Wenlong, Fei Fei, Trevor Darrell, and Yuke Zhu. 2022. “Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents.” Proceedings of the 39th International Conference on Machine Learning (ICML). https://doi.org/10.48550/arXiv.2201.07207.

Inagaki, Takashi, Akari Kato, Koichi Takahashi, Haruka Ozaki, and Genki N. Kanda. 2023. “LLMs Can Generate Robotic Scripts from Goal-Oriented Instructions in Biological Laboratory Automation.” arXiv Preprint arXiv:2304.10267, April. https://doi.org/10.48550/arXiv.2304.10267.

Intology.ai. 2025. “Zochi Publishes a* Paper.” https://www.intology.ai/blog/zochi-acl.

Jablonka, Kevin Maik, Luc Patiny, and Berend Smit. 2022. “Making the collective knowledge of chemistry open and machine actionable.” Nature Chemistry 14 (4): 365–76. https://doi.org/10.1038/s41557-022-00910-7.

Jacobs, Pieter Floris, and Robert Pollice. 2025. “Developing Large Language Models for Quantum Chemistry Simulation Input Generation.” Digital Discovery 4 (3): 762–75. https://doi.org/10.1039/D4DD00366G.

Jansen, Peter, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Daniel S. Weld, and Peter Clark. 2025. “CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-Based Experimentation.” arXiv Preprint arXiv: 2503.22708. https://doi.org/10.48550/arXiv.2503.22708.

Ji, Yixin, Juntao Li, Hai Ye, Kaixin Wu, Kai Yao, Jia Xu, Linjian Mo, and Min Zhang. 2025. “A Survey of Test-Time Compute: From Intuitive Inference to Deliberate Reasoning.” arXiv Preprint. https://doi.org/10.48550/arXiv.2501.02497.

Jiang, Shuo, Daniel Evans-Yamamoto, Dennis Bersenev, Sucheendra K Palaniappan, and Ayako Yachie-Kinoshita. 2024. “ProtoCode: Leveraging Large Language Models (LLMs) for Automated Generation of Machine-Readable PCR Protocols from Scientific Publications.” SLAS Technology 29 (3): 100134. https://doi.org/10.1016/j.slast.2024.100134.

Jing, Xia, Vimla L Patel, James J Cimino, Jay H Shubrook, Yuchun Zhou, Chang Liu, and Sonsoles De Lacalle. 2022. “The Roles of a Secondary Data Analytics Tool and Experience in Scientific Hypothesis Generation in Clinical Research: Protocol for a Mixed Methods Study.” JMIR Research Protocols 11 (7): e39414. https://doi.org/10.2196/39414.

Kahneman, Daniel. 2011. Thinking, Fast and Slow. New York: Farrar, Straus; Giroux.

Kambhampati, Subbarao, Karthik Valmeekam, Miquel Marquez, and Luyang Guan. 2023. “On the Role of Large Language Models in Planning.” Tutorial presented at the International Conference on Automated Planning and Scheduling (ICAPS). https://yochan-lab.github.io/tutorial/ICAPS-2023/.

Kang, Yeonghun, and Jihan Kim. 2024. “ChatMOF: An Artificial Intelligence System for Predicting and Generating Metal-Organic Frameworks Using Large Language Models.” Nature Communications 15 (1): 4705. https://doi.org/10.1038/s41467-024-48998-4.

Kawchak, Kevin. 2024. “High Dimensional and Complex Spectrometric Data Analysis of an Organic Compound Using Large Multimodal Models and Chained Outputs.” ChemRxiv Preprint, September. https://doi.org/10.26434/chemrxiv-2024-06gf1.

Kayali, Moe, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, and Dan Suciu. 2024. “CHORUS: Foundation Models for Unified Data Discovery and Exploration.” Proc. VLDB Endow. 17 (8): 2104–14. https://doi.org/10.14778/3659437.3659461.

Kearnes, Steven M., Michael R. Maser, Michael Wleklinski, Anton Kast, Abigail G. Doyle, Spencer D. Dreher, Joel M. Hawkins, Klavs F. Jensen, and Connor W. Coley. 2021. “The Open Reaction Database.” J. Am. Chem. Soc. 143 (45): 18820–26. https://doi.org/10.1021/jacs.1c09820.

Khalifa, Mohamed, and Mona Albadawy. 2024. “Using artificial intelligence in academic writing and research: An essential productivity tool.” Computer Methods and Programs in Biomedicine Update, 100145. https://doi.org/10.1016/j.cmpbup.2024.100145.

Kinney, Rodney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, et al. 2023. “The Semantic Scholar Open Data Platform.” arXiv Preprint arXiv: 2301.10140. https://doi.org/10.48550/arXiv.2301.10140.

Kon, Patrick Tser Jern, Jiachen Liu, Xinyi Zhu, Qiuyi Ding, Jingjia Peng, Jiarong Xing, Yibo Huang, et al. 2025. “EXP-Bench: Can AI Conduct AI Research Experiments?” arXiv Preprint arXiv: 2505.24785. https://doi.org/10.48550/arXiv.2505.24785.

Kosso, Peter. 2017. What Goes up... Gravity and Scientific Method. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781316417003.

Kuhn, Thomas S. 1962. The Structure of Scientific Revolutions. Vol. 2. International Encyclopedia of Unified Science 2. Chicago: University of Chicago Press.

Kumar, Pankaj, Saurabh Kabra, and Jacqueline M Cole. 2025. “MechBERT: Language Models for Extracting Chemical and Property Relationships about Mechanical Stress and Strain.” Journal of Chemical Information and Modeling.

Kumbhar, Shrinidhi, Venkatesh Mishra, Kevin Coutinho, Divij Handa, Ashif Iquebal, and Chitta Baral. 2025. “Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents.” North American Chapter of the Association for Computational Linguistics. https://doi.org/10.48550/arXiv.2501.13299.

Lakatos, Imre. 1970. “Falsification and the Methodology of Scientific Research Programmes.” In Criticism and the Growth of Knowledge, edited by Imre Lakatos and Alan Musgrave, 91–196. Cambridge: Cambridge University Press.

Leonov, Artem I., Alexander J. S. Hammer, Sławomir Lach, S. Hessam M. Mehr, Dario Caramelli, Davide Angelone, Aamir Khan, et al. 2024. “An Integrated Self-Optimizing Programmable Chemical Synthesis and Reaction Engine.” Nature Communications 15 (1): 4544. https://doi.org/10.1038/s41467-024-45444-3.

Li, Cheng, Mingyang Zhang, Qiaozhu Mei, Yaqing Wang, Spurthi Amba Hombaiah, Yi Liang, and Michael Bendersky. 2023. “Teach LLMs to Personalize - an Approach Inspired by Writing Education.” arXiv Preprint arXiv: 2308.07968. https://doi.org/10.48550/arXiv.2308.07968.

Lim, Sangrak, and Yong Oh Lee. 2020. “Predicting Chemical Properties Using Self-Attention Multi-Task Learning Based on SMILES Representation.” 25th International Conference on Pattern Recognition, ICPR 2020, Virtual Event / Milan, Italy, January 10-15, 2021, 3146–53. https://doi.org/10.1109/ICPR48806.2021.9412555.

Listgarten, Jennifer. 2024. “The Perpetual Motion Machine of AI-Generated Data and the Distraction of ChatGPT as a ‘Scientist’.” Nature Biotechnology 42 (3): 371–73. https://doi.org/10.1038/s41587-023-02103-0.

Liu, Bo, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. 2023. “Llm+ p: Empowering large language models with optimal planning proficiency.” arXiv Preprint arXiv:2304.11477. https://doi.org/10.48550/arXiv.2304.11477.

Liu, Zhihan, Yubo Chai, and Jianfeng Li. 2025. “Toward Automated Simulation Research Workflow Through LLM Prompt Engineering Design.” Journal of Chemical Information and Modeling 65 (1): 114–24. https://doi.org/10.1021/acs.jcim.4c01653.

Livne, Micha, Zulfat Miftahutdinov, Elena Tutubalina, Maksim Kuznetsov, Daniil Polykovskiy, Annika Brundyn, Aastha Jhunjhunwala, et al. 2024. “nach0: Multimodal natural and chemical languages foundation model.” Chemical Science 15 (22): 8380–89. https://doi.org/10.1039/d4sc00966e.

M. Mehr, S Hessam, Dario Caramelli, and Leroy Cronin. 2023. “Digitizing Chemical Discovery with a Bayesian Explorer for Interpreting Reactivity Data.” Proceedings of the National Academy of Sciences 120 (17): e2220045120. https://doi.org/10.1073/pnas.2220045120.

Malkov, Yu A, and Dmitry A Yashunin. 2018. “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.” IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (4): 824–36. https://doi.org/10.1109/tpami.2018.2889473.

Mandal, Indrajeet, Jitendra Soni, Mohd Zaki, Morten M. Smedskjaer, Katrin Wondraczek, Lothar Wondraczek, Nitya Nand Gosvami, and N. M. Anoop Krishnan. 2024. “Autonomous Microscopy Experiments through Large Language Model Agents.” arXiv Preprint arXiv: 2501.10385. https://doi.org/10.48550/arXiv.2501.10385.

McDonald, Robert S., and Paul A. Wilks. 1988. “JCAMP-DX: A Standard Form for Exchange of Infrared Spectra in Computer Readable Form.” Applied Spectroscopy 42 (1): 151–62. https://doi.org/10.1366/0003702884428734.

Mehr, Saman H. M., Mark Craven, Andrei I. Leonov, Graham Keenan, and Leroy Cronin. 2020. “A Universal System for Digitization and Automatic Execution of the Chemical Synthesis Literature.” Science 370 (6512): 101–8. https://doi.org/10.1126/science.abc2986.

Mendible-Barreto, Orlando A., Misael Díaz-Maldonado, Fernando J. Carmona Esteva, J. Emmanuel Torres, Ubaldo M. Córdova-Figueroa, and Yamil J. Colón. 2025. “DynaMate: Leveraging AI-Agents for Customized Research Workflows.” Molecular Systems Design & Engineering 10: 585–98. https://doi.org/10.1039/D5ME00062A.

Miret, Santiago, and N M Anoop Krishnan. 2024. “Are LLMs Ready for Real-World Materials Discovery?” arXiv Preprint arXiv: 2402.05200. https://doi.org/10.48550/arXiv.2402.05200.

Mirza, Adrian, Nawaf Alampara, Sreekanth Kunchapu, Martiño Rı́os-Garcı́a, Benedict Emoekabu, Aswanth Krishnan, Tanya Gupta, et al. 2025. “A Framework for Evaluating the Chemical Knowledge and Reasoning Abilities of Large Language Models Against the Expertise of Chemists.” Nature Chemistry, 1–8. https://doi.org/10.1038/s41557-025-01815-x.

Mishra, Vaibhav, Somaditya Singh, Dhruv Ahlawat, Mohd Zaki, Vaibhav Bihani, Hargun Singh Grover, Biswajit Mishra, Santiago Miret, Mausam, and N. M. Anoop Krishnan. 2024. “Foundational Large Language Models for Materials Research.” arXiv Preprint arXiv: 2412.09560. https://doi.org/10.48550/arXiv.2412.09560.

Narayan, Avanika, Ines Chami, Laurel Orr, Simran Arora, and Christopher Ré. 2022. “Can Foundation Models Wrangle Your Data?” Arxiv Preprint arXiv:2205.09911. https://doi.org/10.48550/ARXIV.2205.09911.

Narayanan, Siddharth M., James D. Braza, Ryan-Rhys Griffiths, Albert Bou, Geemi Wellawatte, Mayk Caldas Ramos, Ludovico Mitchener, Samuel G. Rodriques, and Andrew D. White. 2025. “Training a Scientific Reasoning Model for Chemistry.” arXiv Preprint arXiv: 2506.17238. https://doi.org/10.48550/arXiv.2506.17238.

Naumov, Vladimir, Diana Zagirova, Sha Lin, Yupeng Xie, Wenhao Gou, Anatoly Urban, Nina Tikhonova, et al. 2025. “DORA AI Scientist: Multi-Agent Virtual Research Team for Scientific Exploration Discovery and Automated Report Generation.” bioRxiv, March. https://doi.org/10.1101/2025.03.06.641840.

Neese, Frank. 2022. “Software Update: The ORCA Program System, Version 5.0.” Wiley Interdisciplinary Reviews: Computational Molecular Science 12 (1): e1606. https://doi.org/10.1002/wcms.1606.

Newton, Isaac. 1999. The Principia: Mathematical Principles of Natural Philosophy. Translated by I. Bernard Cohen and Anne Whitman. Berkeley: University of California Press.

Novikov, Alexander, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, et al. 2025. “AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery.” Google DeepMind. https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/AlphaEvolve.pdf.

O’Donoghue, Odhran, Aleksandar Shtedritski, John Ginger, Ralph Abboud, Ali Essa Ghareeb, Justin Booth, and Samuel G Rodriques. 2023. “BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology.” arXiv Preprint arXiv:2310.10632. https://doi.org/10.48550/arXiv.2310.10632.

O’Neill, Charles, Tirthankar Ghosal, Roberta Răileanu, Mike Walmsley, Thang Bui, Kevin Schawinski, and Ioana Ciucă. 2025. “Sparks of Science: Hypothesis Generation Using Structured Paper Data.” arXiv Preprint arXiv: 2504.12976. https://doi.org/10.48550/arXiv.2504.12976.

Pagel, Sebastian, Michal Jirásek, and Leroy Cronin. 2024. “Validation of the Scientific Literature via Chemputation Augmented by Large Language Models.” arXiv Preprint arXiv:2410.06384, October. https://doi.org/10.48550/arXiv.2410.06384.

Park, Nathaniel H., Matteo Manica, Jannis Born, James L. Hedrick, Tim Erdmann, Dmitry Yu. Zubarev, Nil Adell-Mill, Pedro L. Arrechea, et al. 2023. “Artificial Intelligence Driven Design of Catalysts and Materials for Ring Opening Polymerization Using a Domain-Specific Language.” Nature Communications 14 (1). https://doi.org/10.1038/s41467-023-39396-3.

Patiny, Luc, and Guillaume Godin. 2023. “Automatic Extraction of FAIR Data from Publications Using LLM.” ChemRxiv Preprint. https://doi.org/10.26434/chemrxiv-2023-05v1b-v2.

Polak, Maciej P, and Dane Morgan. 2024. “Extracting Accurate Materials Data from Research Papers with Conversational Language Models and Prompt Engineering.” Nature Communications 15 (1): 1569. https://doi.org/10.1038/s41467-024-45914-8.

Popper, Karl R. 1959. The Logic of Scientific Discovery. London: Routledge.

Qu, Jiaxing, Yuxuan Richard Xie, Kamil M. Ciesielski, Claire E. Porter, Eric S. Toberer, and Elif Ertekin. 2023. “Leveraging Language Representation for Material Recommendation, Ranking, and Exploration.” Arxiv Preprint arXiv: 2305.01101, May. https://doi.org/10.48550/arXiv.2305.01101.

Rauschen, Robert, Mason Guy, Jason E. Hein, and Leroy Cronin. 2024. “Universal Chemical Programming Language for Robotic Synthesis Repeatability.” Nature Synthesis 3 (4). https://doi.org/10.1038/s44160-023-00473-6.

Renze, Matthew, and Erhan Guven. 2024. “Self-Reflection in LLM Agents: Effects on Problem-Solving Performance.” arXiv Preprint arXiv: 2405.06682. https://doi.org/10.48550/arXiv.2405.06682.

Rı́os-Garcı́a, Martiño, and Kevin Maik Jablonka. 2025. “LLM-as-Judge Meets LLM-as-Optimizer: Enhancing Organic Data Extraction Evaluations Through Dual LLM Approaches.” AI for Accelerated Materials Design - ICLR. https://openreview.net/forum?id=MjQml5U1Xq.

Rock, Charles. 2018. “A Hypothesis Can’t Be Right Unless It Can Be Proven Wrong.” https://www.stjude.org/research/progress/2018/hypothesis-must-be-falsifiable.html.

Sardiña, Víctor Juan Lamas, Daniel García-González, and Miguel Rodríguez Luaces. 2024. “DSL-Xpert: LLM-Driven Generic DSL Code Generation.” Proceedings of the 27th ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS Companion ’24), September, 5 pages. https://doi.org/10.1145/3652620.3687782.

Schilling-Wilhelmi, Mara, and Kevin Maik Jablonka. 2024. “Using Machine-Learning and Large-Language-Model Extracted Data to Predict Copolymerizations.” AI for Accelerated Materials Design. https://openreview.net/forum?id=zlutCyZ12H.

Schilling-Wilhelmi, Mara, Martiño Rı́os-Garcı́a, Sherjeel Shabih, Marı́a Victoria Gil, Santiago Miret, Christoph T Koch, José A Márquez, and Kevin Maik Jablonka. 2025. “From text to insight: large language models for chemical data extraction.” Chemical Society Reviews. https://doi.org/10.1039/d4cs00913d.

Schmidgall, Samuel, and Michael Moor. 2025. “AgentRxiv: Towards Collaborative Autonomous Research.” arXiv Preprint arXiv: 2503.18102. https://doi.org/10.48550/arXiv.2503.18102.

Schmidgall, Samuel, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. 2025. “Agent Laboratory: Using LLM Agents as Research Assistants.” arXiv Preprint arXiv: 2501.04227. https://doi.org/10.48550/arXiv.2501.04227.

Segler, Marwin, Mike Preuß, and Mark P Waller. 2017. “Towards" Alphachem": Chemical Synthesis Planning with Tree Search and Deep Neural Network Policies.” arXiv Preprint arXiv:1702.00020. https://doi.org/10.48550/arXiv.1702.00020.

Seifrid, Martin, Robert Pollice, Andrés Aguilar-Granda, Zamyla Morgan Chan, Kazuhiro Hotta, Cher Tian Ser, Jenya Vestfrid, Tony C. Wu, and Alán Aspuru-Guzik. 2022. “Autonomous Chemical Experiments: Challenges and Perspectives on Establishing a Self-Driving Lab.” Accounts of Chemical Research 55 (17): 2454–66. https://doi.org/10.1021/acs.accounts.2c00220.

Selivanov, Alexander, Oleg Y Rogov, Daniil Chesakov, Artem Shelmanov, Irina Fedulova, and Dmitry V Dylov. 2023. “Medical image captioning via generative pretrained transformers.” Scientific Reports 13 (1): 4171. https://doi.org/10.1038/s41598-023-31223-5.

Shabih, Sherjeel, Christoph T Koch, Kevin Maik Jablonka, and José A. Márquez. 2025. “Automated Data Extraction from Solar Cell Literature Using Large Language Models.” AI for Accelerated Materials Design - ICLR. https://openreview.net/forum?id=gwLX7cdESk.

Si, Chenglei, Tatsunori Hashimoto, and Diyi Yang. 2025. “The Ideation-Execution Gap: Execution Outcomes of LLM-Generated Versus Human Research Ideas.” arXiv Preprint arXiv: 2506.20803. https://doi.org/10.48550/arXiv.2506.20803.

Si, Chenglei, Diyi Yang, and Tatsunori Hashimoto. 2025. “Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers.” International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2409.04109.

Singh, Nikhil, Lucy Lu Wang, and Jonathan Bragg. 2024. “Figura11y: Ai assistance for writing scientific alt text.” Proceedings of the 29th International Conference on Intelligent User Interfaces, 886–906. https://doi.org/10.1145/3640543.3645212.

Skarlinski, Michael D, Sam Cox, Jon M Laurent, James D Braza, Michaela Hinks, Michael J Hammerling, Manvitha Ponnapati, Samuel G Rodriques, and Andrew D White. 2024. “Language Agents Achieve Superhuman Synthesis of Scientific Knowledge.” arXiv Preprint arXiv:2409.13740. https://doi.org/10.48550/arXiv.2409.13740.

Son, Guijin, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo Ko, Seungwon Lim, Jinyeop Song, et al. 2025. “When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research.” arXiv Preprint arXiv: 2505.11855. https://doi.org/10.48550/arXiv.2505.11855.

Song, Chan Hee, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. 2023. “Llm-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models.” Proceedings of the IEEE/CVF International Conference on Computer Vision, 2998–3009. https://doi.org/10.1109/ICCV51070.2023.00280.

Stanley, Kenneth O., and Joel Lehman. 2015. Why Greatness Cannot Be Planned: The Myth of the Objective. Cham, Switzerland: Springer. https://doi.org/10.1007/978-3-319-15524-1.

Stanley, Kenneth O., Joel Lehman, and Lisa Soros. 2017. “Open-Endedness: The Last Grand Challenge You’ve Never Heard Of.” https://www.oreilly.com/radar/open-endedness-the-last-grand-challenge-youve-never-heard-of/.

Starace, Giulio, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, et al. 2025. “PaperBench: Evaluating AI’s Ability to Replicate AI Research.” arXiv Preprint arXiv: 2504.01848. https://doi.org/10.48550/arXiv.2504.01848.

Stechly, Kaya, Karthik Valmeekam, and Subbarao Kambhampati. 2024. “Chain of Thoughtlessness? An Analysis of Cot in Planning.” The Thirty-Eighth Annual Conference on Neural Information Processing Systems. https://doi.org/10.48550/arXiv.2405.04776.

Steiner, Sebastian, Jakob Wolf, Stefan Glatzel, Anna Andreou, Jarosław M. Granda, Graham Keenan, Trevor Hinkley, et al. 2019. “Organic Synthesis in a Modular Robotic System Driven by a Chemical Programming Language.” Science 363 (6423): eaav2211. https://doi.org/10.1126/science.aav2211.

Strateos. 2023. “Autoprotocol Specification.” https://autoprotocol.org/specification/.

Strieth-Kalthoff, Felix, Han Hao, Vandana Rathore, Joshua Derasp, Théophile Gaudin, Nicholas H. Angello, Martin Seifrid, et al. 2024. “Delocalized, Asynchronous, Closed-Loop Discovery of Organic Laser Emitters.” Science 384 (6697): eadk9227. https://doi.org/10.1126/science.adk9227.

The Danish National Committee on Health Research Ethics. 2024. “Hypothesis-Generating Research.” https://researchethics.dk/guidelines/-guidance-on-hypothesis-generating-research.

Tian, Minyang, Luyu Gao, Shizhuo Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, et al. 2024. “Scicode: A Research Coding Benchmark Curated by Scientists.” Advances in Neural Information Processing Systems 37: 30624–50. https://doi.org/10.48550/arXiv.2407.13168.

Tom, Gary, Stefan P. Schmid, Sterling G. Baird, Yang Cao, Kourosh Darvish, Han Hao, Stanley Lo, et al. 2024. “Self-Driving Laboratories for Chemistry and Materials Science.” Chemical Reviews 124 (16): 9633–732. https://doi.org/10.1021/acs.chemrev.4c00055.

Trewartha, Amalie, Nicholas Walker, Haoyan Huo, Sanghoon Lee, Kevin Cruse, John Dagdelen, Alexander Dunn, Kristin A Persson, Gerbrand Ceder, and Anubhav Jain. 2022. “Quantifying the Advantage of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science.” Patterns 3 (4).

Tu, Zhengkai, Sourabh J Choure, Mun Hong Fong, Jihye Roh, Itai Levin, Kevin Yu, Joonyoung F Joung, et al. 2025. “ASKCOS: an open source software suite for synthesis planning.” arXiv Preprint arXiv:2501.01835. https://doi.org/10.48550/arXiv.2501.01835.

Vangala, Sarveswara Rao, Sowmya Ramaswamy Krishnan, Navneet Bung, Dhandapani Nandagopal, Gomathi Ramasamy, Satyam Kumar, Sridharan Sankaran, Rajgopal Srinivasan, and Arijit Roy. 2024. “Suitability of Large Language Models for Extraction of High-Quality Chemical Reaction Dataset from Patent Literature.” Journal of Cheminformatics 16 (1): 131. https://doi.org/10.1186/s13321-024-00928-8.

Vaucher, Alain C., Federico Zipoli, Joppe Geluykens, Vishnu H. Nair, Philippe Schwaller, Teodoro Laino, et al. 2020. “Automated Extraction of Chemical Synthesis Actions from Experimental Procedures.” Nature Communications 11 (1). https://doi.org/10.1038/s41467-020-17266-6.

Vriza, Aikaterini, Henry C. Chan, Jie Xu, Keith L. Barnett, Ian Staffell, Oleksandr Stanevich, Siqi Du, et al. 2023. “Self-Driving Laboratory for Polymer Electronics.” Chemistry of Materials 35 (8): 3046–56. https://doi.org/10.1021/acs.chemmater.2c03593.

Wan, Yuwei, Tong Xie, Nan Wu, Wenjie Zhang, Chunyu Kit, and Bram Hoex. 2024. “From Tokens to Materials: Leveraging Language Models for Scientific Discovery.” arXiv Preprint arXiv: 2410.16165. https://doi.org/10.48550/arXiv.2410.16165.

Wang, Chengshi, Yeon-Ju Kim, Aikaterini Vriza, Rohit Batra, Arun Baskaran, Naisong Shan, Nan Li, et al. 2025. “Autonomous Platform for Solution Processing of Electronic Polymers.” Nature Communications 16 (1): 1498. https://doi.org/10.1038/s41467-024-55655-3.

Wang, Evan, Federico Cassano, Catherine Wu, Yunfeng Bai, Will Song, Vaskar Nath, Ziwen Han, Sean Hendryx, Summer Yue, and Hugh Zhang. 2024. “Planning in Natural Language Improves Llm Search for Code Generation.” arXiv Preprint arXiv:2409.03733. https://doi.org/10.48550/arXiv.2409.03733.

Wang, Qingyun, Doug Downey, Heng Ji, and Tom Hope. 2023. “SciMON: Scientific Inspiration Machines Optimized for Novelty.” arXiv Preprint arXiv: 2305.14259. https://doi.org/10.48550/arXiv.2305.14259.

Wang, Zhenbin, Kevin Cruse, Yifei Fei, Aaron Chia, Yihuang Zeng, Haozhe Huo, Tianxiao He, Bowen Deng, Olga Kononova, and Gerbrand Ceder. 2022. “ULSA: Unified Language of Synthesis Actions for the Representation of Inorganic Synthesis Protocols.” Digital Discovery 1 (3): 313–24. https://doi.org/10.1039/D2DD00049D.

Warr, Wendy A. 2014. “A short review of chemical reaction database systems, computer-aided synthesis design, reaction prediction and synthetic feasibility.” Molecular Informatics 33 (6-7): 469–76. https://doi.org/10.1002/minf.201400052.

Wellawatte, Geemi P, and Philippe Schwaller. 2025. “Human interpretable structure-property relationships in chemistry using explainable machine learning and large language models.” Communications Chemistry 8 (1): 11. https://doi.org/10.1038/s42004-024-01393-y.

Wellawatte, Geemi P, Aditi Seshadri, and Andrew D White. 2022. “Model Agnostic Generation of Counterfactual Explanations for Molecules.” Chemical Science 13 (13): 3697–3705. https://doi.org/10.1039/d1sc05259d.

Wierenga, Rick P., Stefan M. Golas, Wilson Ho, Connor W. Coley, and Kevin M. Esvelt. 2023. “PyLabRobot: An Open-Source, Hardware-Agnostic Interface for Liquid-Handling Robots and Accessories.” Device 1 (4): 100111. https://doi.org/10.1016/j.device.2023.100111.

Wilbraham, Liam, S. Hessam M. Mehr, and Leroy Cronin. 2021. “Digitizing Chemistry Using the Chemical Processing Unit: From Synthesis to Discovery.” Accounts of Chemical Research 54 (2): 253–62. https://doi.org/10.1021/acs.accounts.0c00674.

Wu, Tongwei, Yao Sun, Xiaoxi Guo, Lin Tian, Yanning Zhang, Haitao Zhao, and Yuen Wu. 2025. “A Large Language Models-Guided Grand Canonical DFT Framework for Accelerating the Discovery of Efficient Electrocatalysts.”

Xie, Tong, Yuwei Wan, Yixuan Liu, Yuchen Zeng, Shaozhou Wang, Wenjie Zhang, Clara Grazian, et al. 2025. “DARWIN 1.5: Large Language Models as Materials Science Adapted Learners.” Arxvi Preprint arXiv:2412.11970, January. https://doi.org/10.48550/arXiv.2412.11970.

Yamada, Yutaro, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. 2025. “The AI Scientist-V2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search.” arXiv Preprint arXiv: 2504.08066. https://doi.org/10.48550/arXiv.2504.08066.

Yan, Cong, and Yeye He. 2020. “Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks.” Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD/PODS ’20, May. https://doi.org/10.1145/3318464.3389738.

Yang, Zonglin, Wanhao Liu, Ben Gao, Yujie Liu, Wei Li, Tong Xie, Lidong Bing, Wanli Ouyang, Erik Cambria, and Dongzhan Zhou. 2025. “MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search.” arXiv Preprint arXiv: 2505.19209. https://doi.org/10.48550/arXiv.2505.19209.

Yang, Zonglin, Wanhao Liu, Ben Gao, Tong Xie, Yuqiang Li, Wanli Ouyang, Soujanya Poria, Erik Cambria, and Dongzhan Zhou. 2025. “MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses.” The Thirteenth International Conference on Learning Representations, ICLR. https://doi.org/10.48550/arXiv.2410.07076.

Yoshikawa, Naruki, Marta Skreta, Kourosh Darvish, Sebastian Arellano-Rubach, Zhi Ji, Lasse Bjørn Kristensen, Andrew Zou Li, et al. 2023. “Large Language Models for Chemistry Robotics.” Autonomous Robots 47 (8): 1057–86. https://doi.org/10.1007/s10514-023-10136-2.

Zhang, Di, Wei Liu, Qian Tan, Jingdan Chen, Hang Yan, Yuliang Yan, Jiatong Li, et al. 2024. “Chemllm: A chemical large language model.” arXiv Preprint. https://doi.org/10.48550/arXiv.2402.06852.

Zhang, Jenny, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. 2025. “Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents.” arXiv Preprint. https://doi.org/10.48550/arXiv.2505.22954.

Zhang, Jenny, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. 2024. “OMNI: Open-Endedness via Models of Human Notions of Interestingness.” International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2306.01711.

Zhang, Wei, Qinggong Wang, Xiangtai Kong, Jiacheng Xiong, Shengkun Ni, Duanhua Cao, Buying Niu, et al. 2024. “Fine-Tuning Large Language Models for Chemical Text Mining.” Chemical Science 15 (27): 10600–10611. https://doi.org/10.1039/D4SC00924J.

Zhao, Zihan, Da Ma, Lu Chen, Liangtai Sun, Zihao Li, Yi Xia, Bo Chen, et al. 2024. “ChemDFM: A Large Language Foundation Model for Chemistry.” arXiv Preprint. https://doi.org/10.48550/arXiv.2401.14818.

Zheng, Zhiling, Zhiguo He, Omar Khattab, Nakul Rampal, Matei A. Zaharia, Christian Borgs, Jennifer T. Chayes, and Omar M. Yaghi. 2024. “Image and Data Mining in Reticular Chemistry Powered by GPT-4V.” Digital Discovery 3 (3): 491–501. https://doi.org/10.1039/d3dd00239j.

Zheng, Zhiling, Oufan Zhang, C. Borgs, J. Chayes, and O. Yaghi. 2023. “ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis.” Journal of the American Chemical Society. https://doi.org/10.1021/jacs.3c05819.

Zhou, Andy, and Ron Arel. 2025. “Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search.” arXiv Preprint. https://doi.org/10.48550/arXiv.2503.10619.

Zou, Yunheng, Austin H. Cheng, Abdulrahman Aldossary, Jiaru Bai, Shi Xuan Leong, Jorge Arturo Campos-Gonzalez-Angulo, Changhyeok Choi, et al. 2025. “El Agente: An Autonomous Agent for Quantum Chemistry.” Matter 8 (7): 102263. https://doi.org/10.1016/j.matt.2025.102263.