5  Applications

5.1 Automating the Scientific Workflow

Recent advances in general-purpose model (GPM)s, particularly large language model (LLM)s, have enabled initial demonstrations of fully autonomous artificial intelligence (AI) scientists [Schmidgall et al. (2025)]. We define these as LLM-powered architectures capable of executing end-to-end research workflows based solely on the final objectives, e.g., “Unexplained rise of antimicrobial resistance in Pseudomonas. Formulate hypotheses, design confirmatory in vitro assays, and suggest repurposing candidates for liver-fibrosis drugs”. Such systems navigate partially or entirely through all components of the scientific process outlined in Figure 5.1, and detailed in the subsequent sections.

While significant applications emerge in machine learning (ML) and programming, scientific implementations remain less explored.

5.1.1 Coding and ML Applications of AI Scientists

Notable frameworks, including Co-Scientist [Gottweis et al. (2025)], and \acr{ai}``-Scientist [Yamada et al. (2025)], aim to automate the entire ML research pipeline, typically employing multi-agent architectures (described in detail in Section 3.12.2.1) where specialized agents manage distinct phases of the scientific method [Schmidgall and Moor (2025)]. Critical to these systems is self-reflection [Renze and Guven (2024)]—iterative evaluation and criticism of results within validation loops. However, comparative analyses reveal that LLM-based evaluations frequently overscore outputs relative to human assessments [Q. Huang et al. (2023); Chan et al. (2024); Starace et al. (2025)]. From an engineering perspective, alternative approaches focus specifically on iterative code optimization, enabling systems to refine their codebases [J. Zhang et al. (2025)] or generate improved agents autonomously [Hu, Lu, and Clune (2024)]. In another work, AlphaEvolve [Novikov et al. (2025)], which is an LLM operating within a genetic algorithm (GA) environment, found novel algorithms for matrix multiplication (which had seen no innovation in fifty years) and sorting.

5.1.3 Are these Systems Capable of Real Autonomous Research?

Although agents like Zochi [Intology.ai (2025)] achieved peer-reviewed publication in top-tier venues (association for computational linguistics (ACL) 2025), their capacity for truly autonomous end-to-end research remains debatable [Son et al. (2025)]. Even when generating hypotheses that appear novel and impactful, their execution and reporting of these ideas, as demonstrated by Si, Hashimoto, and Yang (2025), yield results deemed less attractive than those produced by humans. Additionally, this autonomy raises a critical question: What should the role of AI in science be? While these systems can generate hypotheses, conduct experiments, and produce publication-ready manuscripts, their integration demands careful consideration (refer to Section 7.3 for further discussion about moral concerns around these systems). Beyond the vision of fully autonomous scientists, GPMs—primarily LLMs—are already utilized across most scientific workflow components, for which LLMs have proven useful for some. These elements are shown in Figure 5.1, and we discuss next.

Figure 5.1: Overview of the scientific process. The outer elements represent the typical scientific research process: from gathering information and generating hypotheses based on the observations, to executing experiments and analyzing the results. The terms that are in the center represent data-driven “shortcuts” that “accelerate” the typical scientific method. All stages represented in the figure are discussed in detail in the following sections.

5.2 Existing GPMs for Chemical Science

The development of GPMs for chemical science represents a departure from traditional single-task approaches. Rather than fine-tuning pre-trained models for specific applications such as property prediction or molecular generation, these chemistry-aware language models are intentionally designed to perform multiple chemical tasks simultaneously. This multitask paradigm offers several advantages: shared representations across related chemical problems[Christofidellis et al. (2023)], improved data efficiency through transfer learning between tasks[Lim and Lee (2020)], and the potential for emergent capabilities that arise from joint training across diverse chemical domains[Livne et al. (2024)].

5.2.1 Multitask Learning Frameworks

DARWIN 1.5 pioneered the multitask approach by fine-tuning Llama-7B through a two-stage process [Xie et al. (2025)]. Initially trained on 332k scientific question-answer pairs to establish foundational scientific reasoning, the model subsequently underwent multitask learning to perform property prediction-related regression and classification tasks concurrently.

Building on similar principles, nach0 introduced a unified encoder-decoder transformer architecture for cross-domain chemical tasks [Livne et al. (2024)]. Pre-trained using self-supervised learning (SSL) on both natural language and chemical data, nach0 tackles diverse downstream applications including molecular structure generation, chemical property prediction, and reaction prediction. Notably, the authors found that combining chemistry-specific tasks outperformed models trained on distinct task groups, suggesting that chemical reasoning benefits from focused domain integration.

In the materials domain, Qu et al. (2023) developed a language-driven materials discovery framework that uses transformer-based embeddings (e.g., MatBERT[Wan et al. (2024)]) to represent and generate novel crystal structures. Candidates are first recalled via similarity in embedding space, then ranked using a multitask multi-gate mixture of experts (MoE) model that predicts the desired properties jointly. Their method successfully identifies novel high-performance materials (e.g., halide perovskites) and demonstrates that language representations encode latent knowledge for task-agnostic materials design.

5.2.1.1 Domain-Specific Pre-Training Strategies

A second category of GPMs emphasizes deep domain knowledge through specialized pre-training. LLaMat employed parameter-efficient fine-tuning (PEFT) specifically on crystal structure data in crystallographic information file (CIF) format, enabling the generation of thermodynamically stable structures [Mishra et al. (2024)].

ChemDFM scales this concept significantly, implementing domain pre-training on over 34 billion tokens from chemical textbooks and research articles [Zhao et al. (2024)]. Through comprehensive instruction tuning, ChemDFM familiarizes itself with chemical notation and patterns, distinguishing it from more materials-focused approaches like LLaMat through its broader chemical knowledge base.

ChemLLM further refined this approach by introducing template-based instruction tuning (ChemData) to optimize property-guided molecular generation [D. Zhang et al. (2024)].

5.2.1.2 Reasoning-Based Approaches

A recent development in chemical GPMs incorporates explicit reasoning capabilities. ether0 demonstrates this approach as a 24 billion parameter reasoning model trained on over 640k experimentally-grounded chemistry problems across diverse tasks, including retrosynthesis, solubility editing, and property prediction [Narayanan et al. (2025)]. Unlike previous models, ether0 uses reinforcement learning (RL) (see Section 3.7) to develop reasoning behaviors like verification and backtracking, demonstrating that structured problem-solving approaches can significantly improve performance on complex chemical tasks while maintaining grounding in experimental data.

These diverse approaches illustrate the evolving landscape of chemical GPMs, each balancing broad applicability with domain-specific precision. Still, most applications of GPMs focus on using these models for one specific application and we will review those in the following.

5.3 Knowledge Gathering

The rate of publishing keeps growing, and as a result, it is increasingly challenging to manually collect all relevant knowledge, potentially stymying scientific progress.[Schilling-Wilhelmi et al. (2025); Chu and Evans (2021)] Even though knowledge collection might seem like a simple task, it often involves multiple different steps, visually described in Figure 5.2. We split the discussion in this section in two: structured data extraction and question answering. Example queries for both sections are in Figure 5.2 B.

Figure 5.2: A. a representation of a typical agent for scientific queries. The LLM is the central piece of the system, surrounded by typical tools that improve its question-answering capabilities, together forming an agentic system. The tools represented in this figure are semantic search, citation traversal, evidence gathering, and question answering. Semantic search finds relevant documents. Evidence gathering ranks and filters chunks of text using LLMs. The citation traversal tool provides model access to citation graphs, enabling accurate referencing of each chunk and facilitating the discovery of additional sources. Finally, the question-answering tool (an llm) collects all the information found by other tools and generates a final response to a user’s query. This part the figure is inspired by the PaperQA2 agent.(Skarlinski et al. 2024) B. two examples of applications discussed in this section.

5.3.2 Structured Data Extraction

Semantic search can help us find relevant resources. However, for many applications it can be useful to collect data in a structured form, e.g., tables with fixed columns. Obtaining such a dataset based on extracting data from the literature using LLMs is currently one of the most practical avenues for the use of LLMs in the chemical sciences [Schilling-Wilhelmi et al. (2025)]. Currently, LLMs are used in various forms for this application.

5.3.2.1 Data Extraction Using Prompting

For most applications, zero-shot prompting should be the starting point. In this context, zero-shot prompting has been used to extract data about organic reactions[Rı́os-Garcı́a and Jablonka (2025); Vangala et al. (2024); Patiny and Godin (2023)], synthesis procedures for metal-organic frameworks[Zheng et al. (2023)], polymers[Schilling-Wilhelmi and Jablonka (2024); Gupta et al. (2024)], solar cells[Shabih et al. (2025)], or materials data[Polak and Morgan (2024); Hira et al. (2024); Kumar, Kabra, and Cole (2025); Wu et al. (2025); S. Huang and Cole (2022)].

5.3.2.2 Fine-tuning Based Data Extraction

If a commercial model needs to be run very often, it can be more cost-efficient to fine-tune a smaller, open-source model compared to prompting a large model. In addition, models might lack specialized knowledge and might not follow certain style guides, which can be introduced with fine-tuning. Ai et al. (2024) fine-tuned the LLaMa-2-7B model to extract procedural chemical reaction data from United States Patent and Trademark Office (USPTO), and converted it to a JavaScript object notation (JSON) format compatible with the schema of Open Reaction Database (ORD)[Kearnes et al. (2021)], achieving an overall accuracy of more than \(90\%\). In a different work, W. Zhang et al. (2024) fine-tuned GPT-3.5-Turbo to recognize and extract chemical entities from USPTO. Fine-tuning improved the performance of the base model on the same task by more than \(15\%\). Similarly, Dagdelen et al. (2024) proposed a human-in-the-loop data annotation process, where humans correct the outputs from an LLM extraction instead of extracting data from scratch.

5.3.2.3 Agents for Data Extraction

Agents (Section 3.12) have shown their potential in data extraction, though to a limited extent.[K. Chen et al. (2024); Kang and Kim (2024)] For example, Ansari and Moosavi (2024) introduced Eunomia, an agent that autonomously extracts structured materials science data from scientific literature without requiring fine-tuning, demonstrating performance comparable to or better than fine-tuned methods. Their agent is an LLM with access to tools such as chemical databases (e.g., the Materials Project database) and research papers from various sources.

While the authors claim this approach simplifies dataset creation for materials discovery, the evaluation is limited to a narrow set of materials science tasks (mostly focusing on metal-organic framework (MOF)s), indicating the need for the creation of agent evaluation tools.

5.3.2.4 Limitations

The ability to extract data from sources other than text is important since a large amount of data is only stored in plots, tables, and figures. Despite some initial simple proofs of concept [Zheng et al. (2024)], the main bottleneck presently is the limited understanding of image data compared to text data in multimodal models.[Alampara et al. (2024)] The promise of agents lies in their ability to interact with tools (that can also interpret multimodal data). Moreover, their ability to self-reflect could automatically improve wrong results.[Du et al. (2023)]

5.3.3 Question Answering

Besides extracting information from documents in a structured format, LLMs can also be used to answer questions—such as “Has X been tried before” by synthesizing knowledge from a corpus of documents (and potentially automatically retrieving additional documents).

An example of a system that can do that is PaperQA. This agentic system contains tools for search, evidence-gathering, and question answering as well as for traversing citation graphs, which are shown in Figure 5.2. The evidence-gathering tool collects the most relevant chunks of information via the semantic search and performs LLM-based re-ranking of these chunks (i.e. the LLM changes the order of the chunks depending on what is needed to answer the query). Subsequently, only the top-\(n\) most relevant chunks are kept. To further ground the responses, citation traversal tools (e.g., Semantic Scholar[Kinney et al. (2023)]) are used. These leverage the citation graph as a means of discovering supplementary literature references. Ultimately, to address the user’s query, a question-answering tool is employed. It initially augments the query with all the collected information before providing a definitive answer. The knowledge aggregated by these systems could be used to generate new hypotheses or challenge existing ones. Thus, in the next section, we focus on this aspect.

5.4 Hypothesis Generation

Coming up with new hypotheses represents a cornerstone of the scientific process [Rock (2018)]. Historically, hypotheses have emerged from systematic observation of natural phenomena, exemplified by Isaac Newton’s formulation of the law of universal gravitation [Newton (1999)], which was inspired by the seemingly mundane observation of a falling apple [Kosso (2017)].

In modern research, hypothesis generation increasingly relies on data-driven tools. For example, clinical research employs frameworks such as visual interactive analytic tool for filtering and summarizing large health data sets (VIADS) to derive testable hypotheses from well-curated datasets [Jing et al. (2022)]. Similarly, advances in LLMs are now being explored for their potential to automate and enhance idea generation across scientific domains [O’Neill et al. (2025)]. However, such approaches face significant challenges due to the inherently open-ended nature of scientific discovery [Stanley, Lehman, and Soros (2017)]. Open-ended domains, as discussed in Chapter 2, risk intractability, as an unbounded combinatorial space of potential variables, interactions, and experimental parameters complicates systematic exploration [Clune (2019)]. Moreover, the quantitative evaluation of the novelty and impact of generated hypotheses remains non-trivial. As Karl Popper argued, scientific discovery defies rigid logical frameworks [Popper (1959)], and objective metrics for “greatness” of ideas are elusive [Stanley and Lehman (2015)]. These challenges underscore the complexity of automating or systematizing the creative core of scientific inquiry.

5.4.1 Initial Sparks

Recent efforts in the ML community have sought to simulate the hypothesis formulation process [Gu and Krenn (2025); Arlt et al. (2024)], primarily leveraging multi-agent systems [Jansen et al. (2025); Kumbhar et al. (2025)]. In such frameworks, agents typically retrieve prior knowledge to contextualize previous related work—grounding hypothesis generation in existing literature [Naumov et al. (2025); Ghareeb et al. (2025); Gu and Krenn (2024)]. A key challenge, however, lies in evaluating the generated hypotheses. While some studies leverage LLMs to evaluate novelty or interestingness [J. Zhang et al. (2024)], recent work has introduced critic agents—specialized components designed to monitor and iteratively correct outputs from other agents—into multi-agent frameworks (see Section 3.12.2.1). For instance, Ghafarollahi and Buehler (2024) demonstrated how integrating such critics enables systematic hypothesis refinement through continuous feedback mechanisms.

However, the reliability of purely model-based evaluation remains contentious. Si, Yang, and Hashimoto (2025) argued that relying on a GPM to evaluate hypotheses lacks robustness, advocating instead for human assessment. This approach was adopted in their work, where human evaluators validated hypotheses produced by their system, finding more novel LLM-produced hypotheses compared to the ones proposed by humans. Notably, Yamada et al. (2025) advanced the scope of such systems by automating the entire research ML process, from hypothesis generation to article writing. Their system’s outputs were submitted to workshops at the International Conference on Learning Representations (ICLR) 2025, with one contribution ultimately accepted. However, the advancements made by such works are currently incremental instead of unveiling new, paradigm-shifting research (see Figure 5.3).

5.4.2 Chemistry-Focused Hypotheses

In scientific fields such as chemistry and materials science, hypothesis generation requires domain intuition, mastery of specialized terminology, and the ability to reason through foundational concepts [Miret and Krishnan (2024)]. To address potential knowledge gaps in LLMs, Q. Wang et al. (2023) proposed a few-shot learning approach (see Section 3.11.1) for hypothesis generation and compared it with model fine-tuning for the same task. Their method strategically incorporates in-context examples to supplement domain knowledge while discouraging over-reliance on existing literature. For fine-tuning, they designed a loss function that penalizes possible biases—e.g., given the context “hierarchical tables challenge numerical reasoning”, the model would be penalized if it generated an overly generic prediction like “table analysis” instead of a task-specific one—when trained on such examples. Human evaluations of ablation studies revealed that GPT-4, augmented with a knowledge graph of prior research, outperformed fine-tuned models in generating hypotheses with greater technical specificity and iterative refinement of such hypotheses.

Complementing this work, Yang, Liu, Gao, Xie, et al. (2025) introduced the Moose-Chem framework to evaluate the novelty of LLM-generated hypotheses. To avoid data contamination, their benchmark exclusively uses papers published after the knowledge cutoff date of the evaluated model, GPT-4o. Ground-truth hypotheses were derived from articles in high-impact journals (e.g., Nature, Science) and validated by domain-specialized PhD researchers. By iteratively providing the model with context from prior studies, GPT-4o achieved coverage of over \(80\%\) of the evaluation set’s hypotheses while accessing only \(4\%\) of the retrieval corpus, demonstrating efficient synthesis of ideas presumably not present in its training corpus.

Figure 5.3: Overview of LLM-based hypothesis generation. Current methods are based on llm-sampling methods in which an LLM proposes new hypotheses. The generated hypotheses are evaluated in terms of novelty and impact either by another LLM or by a human. Then, through experimentation, the hypotheses are transformed into results which showcase that current LLMs cannot produce groundbreaking ideas, limited to their training corpus, resulting in the best cases, in incremental work. This is shown metaphorically with the puzzle. The “pieces of chemical knowledge” based on the hypothesis produced by LLMs are already present in the “chemistry puzzle”, not unveiling new parts of it.

5.4.3 Are LLMs Actually Capable of Novel Hypothesis Generation?

Automatic hypothesis generation is often regarded as the Holy Grail of automating the scientific process [Coley, Eyke, and Jensen (2020)]. However, achieving this milestone remains challenging, as generating novel and impactful ideas requires questioning current scientific paradigms [Kuhn (1962)]—a skill typically refined through years of experience—which is currently impossible for most ML systems.

Current progress in ML illustrates these limitations [Kon et al. (2025); Gu and Krenn (2024)]. Although some studies claim success in AI-generated ideas accepted at workshops in ML conferences via double-blind review [Zhou and Arel (2025)], these achievements are limited. First, accepted submissions often focus on coding tasks, one of the strongest domains for LLMs. Second, workshop acceptances are less competitive than main conferences, as they prioritize early-stage ideas over rigorously validated contributions. In chemistry, despite some works showing promise on these systems [Yang, Liu, Gao, Liu, et al. (2025)], LLMs struggle to propose functional hypotheses [Si, Hashimoto, and Yang (2025)]. Their apparent success often hinges on extensive sampling and iterative refinement, rather than genuine conceptual innovation.

As Kuhn (1962) argued, generating groundbreaking ideas demands challenging prevailing paradigms—a capability missing in current ML models (they are trained to make the existing paradigm more likely in training rather than questioning their training data), as shown in Figure 5.3. Thus, while accidental discoveries can arise from non-programmed events (e.g., Fleming’s identification of penicillin [Fleming (1929); Fleming (1964)]), transformative scientific advances typically originate from deliberate critique of existing knowledge [Popper (1959); Lakatos (1970)]. In addition, very often breakthroughs can also not be achieved by optimizing for a simple metric—as we often do not fully understand the problem and, hence, cannot design a metric.[Stanley and Lehman (2015)] Despite some publications suggesting that AI scientists already exist, such claims are supported only by narrow evaluations that yield incremental progress [Novikov et al. (2025)], not paradigm-shifting insights. For AI to evolve from research assistants into autonomous scientists, it must demonstrate efficacy in addressing societally consequential challenges, such as solving complex, open-ended problems at scale (e.g., “millennium” math problems [Carlson, Jaffe, and Wiles (2006)]).

Finally, ethical considerations become critical as hypothesis generation grows more data-driven and automated. Adherence to legal and ethical standards must guide these efforts (see Section 7.2) [The Danish National Committee on Health Research Ethics (2024)].

With a hypothesis in hand, the next step is often to run an experiment to test it.

5.5 Experiment Planning

Before a human or robot can execute any experiments, a plan must be created. Planning can be formalized as the process of decomposing a high-level task into a structured sequence of actionable steps aimed at achieving a specific goal. The term planning is often confused with scheduling and RL, which are closely related but distinct concepts. Scheduling is a more specific process focused on the timing and sequence of tasks. It ensures that resources are efficiently allocated, experiments are conducted in an optimal order, and constraints (such as lab availability, time, and equipment) are respected.[Kambhampati et al. (2023)] RL is about adapting and improving plans over time based on ongoing results.[P. Chen et al. (2022)]

5.5.1 Conventional Planning

Early experimental planning in chemistry relied on human intuition and domain expertise. One example of this is retrosynthesis. Since the 1960s, systems like logic and heuristics applied to synthetic analysis (LHASA) [Corey, Cramer III, and Howe (1972)] began automating retrosynthesis using hand-coded rules and heuristics[Warr (2014)]. Later tools, such as Chematica[Grzybowski et al. (2018)], expanded these efforts by integrating larger template libraries and optimization strategies. As reaction data grew in volume and complexity, manual rule encoding became unsustainable. Platforms like ASKCOS[Tu et al. (2025)] integrated graph neural network (GNN)s and neural classifiers to predict reactivity and suggest conditions, enabling actionable synthetic routes.

All applications, however, face the problem that planning is difficult because search spaces are combinatorially large and evaluating potential paths, in principle, requires a model that can perfectly predict the outcomes of different actions. Conventional approaches often rely on various forms of search algorithms such as breadth-first search (BFS), depth-first search (DFS), Monte Carlo tree search (MCTS) [Segler, Preuß, and Waller (2017)]. Those, however, are often still not efficient enough to tackle long-horizon planning for complex problems.

5.5.2 LLMs to Decompose Problems into Plans

GPMs, in particular LLMs, can potentially assist in planning with two modes of thinking. Deliberate (system-2-like) thinking can be used to score potential options or to decompose problems into plans. Intuitive (system-1-like) thinking can be used to efficiently prune search spaces. These two modes align with psychological frameworks known as system-1 and system-2 thinking. [Kahneman (2011)] In the system-1 thinking, LLMs support rapid decision-making by leveraging heuristics and pattern recognition to quickly narrow down options. In contrast, system-2 thinking represents a slower, more analytical process, in which LLMs solve complex tasks—such as logical reasoning and planning—by explicitly generating step-by-step reasoning. [Ji et al. (2025)]

Decomposing a goal into actionable milestones relies on this deliberate, system-2-style reasoning, enabling the model to evaluate alternatives and structure plans effectively. A variety of strategies have been proposed to improve the reasoning capabilities of LLMs during inference. Methods such as chain-of-thought (CoT) and least-to-most prompting guide models to decompose problems into interpretable steps, improving transparency and interpretability. However, their effectiveness in planning is limited by error accumulation and linear thinking patterns.[Stechly, Valmeekam, and Kambhampati (2024)] To address these limitations, recent test-time strategies such as repeat sampling and tree search have been proposed to enhance planning capabilities in LLMs. Repeated sampling allows the model to generate multiple candidate reasoning paths, encouraging diversity in thought and increasing the chances of discovering effective subgoal decompositions. [E. Wang et al. (2024)] Meanwhile, tree search methods like tree-of-thought (ToT) and reasoning via planning (RAP) treat reasoning as a structured search, also using algorithms like MCTS to explore and evaluate multiple solution paths, facilitating more global and strategic decision-making. [Hao et al. (2023)]

Beyond purely linguistic reasoning, LLMs have also been used to interpret natural-language queries and to translate them into structured planning steps, as demonstrated by systems like LLM+P[B. Liu et al. (2023)] and LLM-DP[Dagan, Keller, and Lascarides (2023)], which integrated LLMs with classical planners to convert planning problems into planning domain definition language (PDDL). LLMs have also been applied to generate structured procedures from limited observations. For example, in quantum physics, a model was trained to infer reusable experimental templates from measurement data, producing Python code that generalized across system sizes. [Arlt et al. (2024)] This demonstrates how LLMs can support scientific planning by synthesizing high-level protocols from low-level evidence, moving beyond symbolic reasoning to executable plan generation.

5.5.3 Pruning of Search Spaces

Pruning refers to the process of eliminating unlikely or suboptimal options during the search to reduce the computational burden. Because the number of potential pathways can grow exponentially, exhaustive search may be computationally intensive. Classical planners employ heuristics, value functions, or logical filters to perform pruning[Bonet and Geffner (2012)]. LLMs can emulate pruning through learned heuristics, intuitive judgment, or context-driven evaluation, [Gao et al. (2025)] reflecting system-1 thinking. Figure 5.4 illustrates how LLMs can support experimental planning by selectively pruning options. Rule-based heuristics derived from domain knowledge can automatically discard routes involving unfavorable motifs, such as chemically strained rings or complex aromatic scaffolds. Meanwhile, LLMs can emulate an expert chemist’s intuition by discarding synthetic routes that appear unnecessarily long, inefficient, or mechanistically implausible.

To further enhance planning efficacy, LLMs can be augmented with external tools that estimate the feasibility or performance of candidate plans, enabling targeted pruning of the search space before costly execution. In ChemCrow, the LLM collaborated with specialized chemical tools with knowledge about molecular and reaction properites. While ChemCrow does not explicitly generate and prune a large pool of candidate plans, these tools serve as real-time evaluators that help the model avoid unfeasible or inefficient directions during synthesis or reaction planning.

In addition to external tools, LLMs can also engage in self-correction, a reflective strategy that identifies and prunes flawed reasoning steps within their own outputs. This introspective pruning supports more robust and coherent planning by discarding faulty intermediate steps before they affect final decisions. As such, self-correction offers a lightweight yet effective mechanism for narrowing the solution space in complex reasoning tasks. At the highest level of oversight, human-in-the-loop frameworks introduce expert feedback to guide pruning decisions. The ORGANA system[Darvish et al. (2025)] integrated chemist feedback into the planning process, helping define goals, resolve ambiguities, and eliminate invalid directions.

Figure 5.4: GPM-guided retrosynthesis route planning and pruning. GPMs can systematically evaluate and prune retrosynthetic routes using multiple reasoning capabilities to discriminate between viable and problematic approaches. The partially overlapping arrows at the start of each route indicate multiple steps. Route A: This route was pruned by heuristic reasoning due to the unfavorable aromatic core construction. Route B: This route was selected as it successfully passes all gpm planning checks, demonstrating optimal synthetic feasibility. Route C: This route was pruned by external tools due to the poor region-selectivity of the oxidation step. Route D: This route was pruned based on learned intuition, as it represents an inefficient multistep pathway; the route could just start with phenol instead of synthesizing it.

5.5.4 Evaluation

While pruning accelerates planning, its effectiveness depends on reliable evaluation—the ability to judge whether a candidate plan is valid or promising. However, evaluating planning quality is particularly challenging in scientific fields such as chemistry and biology. Many alternative plans may achieve the same goal, so evaluation is inherently ambiguous in the absence of a comprehensive world model. In open-ended domains, evaluation is often conducted manually. For example, ChemCrow [Bran et al. (2024)] relied on expert review to assess the correctness and plausibility of generated outputs. More dynamic evaluations can be performed in simulated or real embodied environments [Song et al. (2023); Choi et al. (2024)], offering interactive feedback on feasibility. In parallel, automatic evaluation methods are emerging. For example, BioPlanner[O’Donoghue et al. (2023)] used pseudocode-based evaluation, comparing LLM-generated protocols to expert-written pseudocode representations to assess plausibility and correctness without requiring manual review or physical execution.

5.6 Experiment Execution

Once an experimental plan is available, whether from a human scientist’s idea or a sophisticated AI model, the next step is to execute it. Regardless of its source, the plan must be translated into concrete, low-level actions for execution. One of the main challenges of lab automation is to convert the high-level and abstract experimental plan into real-world operations carried out by the experimental hardware (liquid-handing systems, robotic arms, instruments, etc.).

It is worth noting that, despite their methodological differences, executing experiments in silico (running simulations or code) and in vitro are not fundamentally different—both follow an essentially identical workflow: Plan \(\rightarrow\) Instructions \(\rightarrow\) Execution \(\rightarrow\) Analysis. In a computer simulation, a researcher writes a program (plan), which is then compiled or interpreted into machine code (instructions) for the central processing unit (CPU), executed to produce data, and finally the outputs are analyzed. In an automated laboratory, the scientist specifies a protocol (plan), which must be translated into instrument commands (instructions), executed on a robotic platform, followed by the analysis of sensor data or assay results. Both scenarios require careful translation of abstract steps into concrete actions, as well as further decision-making based on the acquired results.

The execution of in silico experiments can be reduced to two essential steps: preparing input files and running the computational code; GPMs can be used in both steps.[Z. Liu, Chai, and Li (2025); Mendible-Barreto et al. (2025); Zou et al. (2025); Campbell et al. (2025)] Jacobs and Pollice (2025) found that using a combination of fine-tuning, CoT and RAG (see Section 3.11) can improve the performance of LLMs in generating executable input files for the quantum chemistry software ORCA[Neese (2022)], while Gadde et al. (2025) created AutosolvateWeb, an LLM-based platform that assists users in preparing input files for quantum mechanics/molecular mechanics (QM/MM) simulations of explicitly solvated molecules and running them on a remote computer. Examples of GPM-based autonomous agents (see Section 3.12) capable of performing the entire computational workflow (i.e., preparing inputs, executing the code, and analyzing the results) are MDCrow [Campbell et al. (2025)] (for molecular dynamics) and El Agente Q [Zou et al. (2025)] (for quantum chemistry).

GPMs can also assist in automating in vitro experiments. We can draw parallels from programming language paradigms—compiled vs. interpreted (see Figure 5.5 A)—to better understand how GPMs can be useful in different approaches of experiment automation. In compiled languages (like C++ or Fortran), the entire code is converted ahead of time by another program called the “compiler” into binary machine code, which is directly executable by the hardware. In interpreted languages (like Python or JavaScript), a program called the “interpreter” reads the instructions line-by-line during runtime, translating and executing them on the fly. Compiled languages offer high performance and early error detection, making them ideal for performance-critical systems, but they require a separate compilation step and are less flexible during development. Interpreted languages are easier to use, debug, and modify on the fly, which makes them great for rapid development and scripting, but they generally run slower and catch errors only at runtime. Similarly, we can broadly categorize different approaches to experiment automation into two different groups: “compiled automation” and “interpreted automation” (see Figure 5.5 B). In the compiled approach, the entire protocol is translated—either by a human or a GPM—to low-level instructions before execution, while in interpreted automation, the GPM plays a central role, acting as the “interpreter” and executing the protocol step by step. As we show below, it can be instructive to use this perspective when discussing approaches to automate experiment execution with GPMs.

Figure 5.5: Programming languages vs. lab automation. A) programming paradigms: In compiled languages, the entire source code is translated ahead of time to machine code by the compiler. This stand-alone code is then given to the operating system (OS), which is responsible for scheduling and distributing tasks to the hardware. In interpreted languages, the interpreter reads and translates each line of the source code to machine code and hands it to the OS for execution. B) automation paradigms: In the compiled approach, a gpm formalizes the protocol, a compiler, such as the chempiler(Steiner et al. 2019), translates the formalized protocol to hardware-specific low-level steps, which the controller then executes—a central hub tasked with scheduling and distributing commands to chemical hardware. In the interpreted approach, a GPM, acting as the interpreter, first breaks down the protocol into specific steps, then sends them (via an application programming interface (API)) for execution one by one. The strength of interpreted systems is dynamic feedback: after the execution of each step, the GPM receives a signal (e.g., data, errors), which can influence its behavior for the next steps.

5.6.1 Compiled Automation

In the case of “compiled automation”, the experiment protocol needs to be formalized in a high-level or domain-specific language (DSL) that describes exactly what operations to perform in what order. A chemical compiler (or “chempiler” [Steiner et al. (2019)]) then converts this high-level protocol into low-level code for the specific lab hardware, which is then executed by robotic instruments, orchestrated by a controller (refer to the caption of Figure 5.5 B).

5.6.1.1 Protocol Languages

While Python-based scripts are frequently used as the de facto protocol language due to Python’s accessibility and flexibility,[Wierenga et al. (2023); Vriza et al. (2023); C. Wang et al. (2025)] specialized languages (DSLs) have also been developed to provide more structured and semantically rich representations of experimental procedures.[Z. Wang et al. (2022); Ananthanarayanan and Thies (2010); Strateos (2023); Park et al. (2023)] One of the prominent examples of such languages is chemical description language (\chiDL)[Group (2023)], developed as part of the Chemputer architecture [Steiner et al. (2019); Mehr et al. (2020); Hammer et al. (2021)]. \chiDL uses a JSON-like format, and the experimental protocol is described by defining Reagents, Vessels, etc, and using abstract chemical commands such as Add, Stir, Filter, etc. In the next step, the Chempiler software takes this \chiDL script and a description of the physical connectivity and composition of the automated platform as a graph and translates it into chemical assembly language (ChASM) which is specific to the platform (akin to machine code). In practice, \chiDL has been used to automate multi-step organic syntheses with yields comparable to manual experiments.[Mehr et al. (2020)]

Developing experimental protocols in a formal language is a non-trivial task, often requiring specialized coding expertise. Within the compiled approach, the role of the GPM is to translate natural-language protocols into their formalized, machine-readable counterparts.[Sardiña, García-González, and Luaces (2024); Jiang et al. (2024); Conrad et al. (2025); Inagaki et al. (2023)] Vaucher et al. (2020) used an encoder-decoder transformer model to convert English experimental procedures to structured sequences of pre-defined synthesis actions (e.g., MakeSolution, SetTemperature, Extract). They pre-trained the model on \(2\)M sentence-action pairs extracted by a rule-based natural-language processing (NLP) algorithm and then fine-tuned it on manually annotated samples to improve accuracy. The model achieved exact sentence-pair matching in \(61\%\) of the test samples and had more than \(75\%\) overlap in \(82\%\) of them. Although this approach accelerates automated protocol extraction from chemical literature, the output format is not directly suitable for execution.

Pagel, Jirásek, and Cronin (2024) introduced a multi-agent workflow (based on GPT-4) that can address this issue and convert unstructured chemistry papers into executable code. The first agent extracts all synthesis-relevant text, including supporting information; a procedure agent then sanitizes the data and tries to fill the gaps from chemical databases (using RAG); another agent translates procedures into \chiDL and simulates them on virtual hardware; finally, a critique agent cross-checks the translation and fixes errors.

The example above shows one of the strengths of the compiled approach: it allows for pre-validation. The protocol can be simulated or checked for any errors before running on the actual hardware, ensuring safety. Another example of LLM-based validators for chemistry protocols is CLAIRify.[Yoshikawa et al. (2023)] Leveraging an iterative prompting strategy, it uses GPT-3.5 to first translate the natural-language protocol into \chiDL script, then automatically verifies its syntax and structure, identifies any errors, appends those errors to the prompt, and prompts the LLM again—iterating this process until a valid \chiDL script is produced.

Similar to how compiled software can be recompiled for different platforms, compiled automation is hardware-agnostic: by using appropriate compilation, a well-defined protocol can—at least in principle—be run on different robotic systems as long as they have the required capabilities.[Rauschen et al. (2024); Strieth-Kalthoff et al. (2024); Wilbraham, Mehr, and Cronin (2021)] In practice, however, inconsistencies in hardware interfaces and software standards across the lab automation community make cross-platform execution challenging.

The main limitations of compiled approaches are the flip side of their strengths: low flexibility and adaptability. Any logic or decision-making must either be explicitly encoded within the protocol—necessitating meticulous scripting—or delegated to an external control layer.[M. Mehr, Caramelli, and Cronin (2023); Leonov et al. (2024)] If something unexpected occurs (a pump clogging, a reaction taking longer than expected), the pre-compiled protocol cannot easily adjust in real-time, and human intervention or a complete recompile might be needed.

5.6.2 Interpreted Automation

Interpreted programming languages support higher levels of abstraction, enabling the use of more general and flexible command structures. Similarly, since GPMs can translate high-level goals into concrete steps[Ahn et al. (2022); W. Huang et al. (2022)], they can act as an “interpreter” between the experimental intent and lab hardware. For instance, given an instruction “titrate the solution until it turns purple”, a GPM agent (see Section 3.12) can break it down into smaller steps and convert each step to executable code, allowing it to perform incremental additions of titrant and read a color sensor, looping until the condition is met. This conversion of concrete steps to code happens at runtime; it is not pre-compiled. We refer to such systems as “interpreted automation” systems. In contrast to the deterministic, preplanned nature of compiled systems, interpreted architectures introduce real-time decision-making. As each action completes, the system collects sensor data (instrument readings, spectra, error messages, etc.) which the agent analyzes and decides on the next action. This allows for dynamic branching and conditional logic during the experiment execution.

Coscientist [Boiko et al. (2023)] is an LLM-based chemistry assistant built around GPT-4 that can autonomously design and execute experiments. It can take high-level goals and call tools to write code in real-time in order to control an Opentrons OT-2 liquid-handling robot. The architecture included a web-search module, a documentation module (to read instrument manuals), a Python execution module (to run generated code in a sandbox), and an experiment execution module that sends code to actual lab equipment. If an error occurred, the system would get feedback and GPT-4 would debug its own code. Coscientist successfully planned and executed multistep syntheses with minimal human intervention. For example, it efficiently optimized a palladium cross-coupling reaction with minimal human input, outperforming a standard Bayesian optimizer baseline in finding high-yield conditions.

Another example is ChemCrow [Bran et al. (2024)], a GPT-4-based agent augmented with \(18\) expert-designed tools for tasks like compound lookup, spectral analysis, and retrosynthesis. ChemCrow can perform tasks across synthesis planning, drug discovery, and materials design by invoking external software for things like retrosynthesis, property prediction, database queries, etc. It planned and executed the syntheses of an insect repellent, N,N-diethyl-meta-toluamide (DEET), and three different organocatalysts and even guided the discovery of a new chromophore dye.

The interpreted paradigm is highly generalizable; in principle, the same LLM agent controlling a chemistry experiment could be re-purposed to a biology or materials experiment with minimal reprogramming because it operates at the level of intent and semantic understanding. However, fully autonomous labs featuring interpreted automation are still experimental themselves—ensuring their reliability and accuracy remains an open challenge.

Despite being labeled as “autonomous,” both systems mentioned above often need prompting nudges and human correction. In addition, these models can replicate known procedures and use databases, but they lack an understanding of mechanisms or underlying principles. Another issue is full reproducibility and long-term experiment tracking. Since the GPM’s response might not be deterministic, small changes in prompts can yield different results and closed-source models like GPT-4 can change over time. Hallucinations remain a risk, especially in planning complex or sensitive reactions. In addition, allowing an agent to control hardware brings safety considerations; the flexibility of GPMs means that they can devise unanticipated actions. Designing safety nets for these systems is an active area of research. (see Section 7.2)

5.6.3 Hybrid Approaches

Between the two extremes of fully compiled vs. fully interpreted automation lies a hybrid approach that seeks to combine the best of both paradigms: the safety and reliability of compiled protocols and the AI-driven flexibility of interpreted systems.

The key difference from purely interpreted systems is that during each experiment run, the plan is fixed, ensuring safety and reproducibility, but between runs, the plan can dynamically change based on the GPM’s interpretation of results. Once the initial plan (ideally devised by the same GPM in a previous step) is provided to a hybrid system, instead of reducing it to smaller steps and directly sending the instructions to a laboratory one at a time, the protocol is first formalized—i.e., it is translated to a formal machine-readable format such as \chiDL. Once validated, the formalized protocol is compiled and executed. After the completion of execution, the GPM receives the results and decides what experiment to perform next. This cycle repeats, creating an autonomous optimization or discovery loop.

This hybrid strategy is attractive because it provides a safety net against mistakes made by the GPM interpreter; any generated procedure must pass through a formalization and verification stage before real execution, and therefore, erroneous or hallucinated steps can be caught. For example, if the interpreter hallucinated adding 1000 mL of a solvent but the hardware has only 100 mL capacity, it can be flagged as an error.

ORGANA [Darvish et al. (2025)] is an LLM-based robotic assistant following this hybrid paradigm. It allows human chemists to describe their experimental goal in natural language. The system can converse with the user to clarify ambiguous requests (the agent would ask “do you mean X or Y?” if the instructions are unclear). Once the goal is understood, it uses CLAIRify [Yoshikawa et al. (2023)] to convert and validate the natural-language description of a chemistry experiment into a \chiDL script, which can be executed on a compatible platform. In one case, ORGANA carried out a multistep electrochemistry procedure—polishing electrodes, running an experiment, and analyzing the data—involving 19 substeps that it coordinated in parallel. If an unexpected observation occurred (e.g., a solution does not change color when expected), the system can notice via image analysis and modify the plan or alert the user. In user studies, ORGANA significantly reduced the manual labor and frustration for chemists, who could offload tedious tasks and trust the agent to handle low-level decisions.

5.6.4 Comparison and Outlook

While compiled paradigms continue to provide the backbone for reliable automation, interpreted paradigms will drive exploratory research, where adaptability is key. Hybrid systems are likely to be the bridge that brings AI into mainstream lab operations, ensuring that flexibility comes with accountability. A brief comparison of the three mentioned approaches is given in Table 5.1.

Table 5.1: Comparison of the Compiled, Interpreted, and Hybrid Automation Paradigms. Each approach has its strengths and weaknesses. Compiled systems favor reliability, interpreted systems allow for more flexibility, while hybrid systems try to strike a balance.
Feature Compiled Interpreted Hybrid
Flexibility Low High Medium
Adaptivity None Real-time Iterative
Reproducibility High Medium High
Safety High Low Medium
Setup Overhead Medium High High
Industrial Readiness Low Low Low

While we are essentially witnessing the rise of self-driving laboratories, autonomous experimentation systems present a range of challenges.[Tom et al. (2024); Seifrid et al. (2022)] First, translating high-level natural-language goals into precise laboratory actions remains difficult, as GPMs can misinterpret ambiguous instructions, leading to invalid or unsafe procedures. This problem is compounded by the lack of universally adopted standards for protocol formalization; while languages like \chiDL show promise, inconsistencies in abstraction, device compatibility, and community uptake limit interoperability. Real-time execution adds further complexity, as systems must detect and respond to failures or unexpected behaviors; however, general-purpose validation mechanisms and recovery strategies remain underdeveloped. Hardware integration is another bottleneck; current commercial robotic platforms are prohibitively expensive and lab environments often rely on a patchwork of instruments with proprietary interfaces, and building robust, unified control layers demands considerable engineering overhead. Another challenge is multi-modality in chemistry; chemists use a wide variety of data (e.g., spectra, TLC plates, SEM images). Without integrating these forms of output, models will be limited in their decision-making. Finally, ensuring reproducibility and regulatory compliance requires that every step be logged, validated, and traceable at the level required for clinical or industrial adoption (see Section 7.2. These challenges must be addressed in tandem to move from experimental demonstrations toward reliable, scalable, and trustworthy autonomous laboratories.

5.7 Data Analysis

The analysis of spectroscopic and experimental data in chemistry remains a predominantly manual process. Even seemingly straightforward steps, such as plotting or summarizing results, demand repeated manual intervention.

One key challenge that makes automation particularly difficult is the extreme heterogeneity of chemical data sources. Laboratories often rely on a wide variety of instruments, some of which are decades old, rarely standardized, or unique in configuration.[Jablonka, Patiny, and Smit (2022)] These devices output data in incompatible, non-standardized, or poorly documented formats, each requiring specialized processing pipelines. Despite efforts like JCAMP-DX [McDonald and Wilks (1988)], standardization attempts remain scarce and have generally failed to gain widespread use. This diversity makes rule-based or hard-coded solutions largely infeasible, as they cannot generalize across the long tail of edge cases and exceptions found in real-world workflows.

However, this exact complexity makes data analysis in chemistry a promising candidate for GPMs. They are designed to operate flexibly across diverse tasks and formats, relying on implicit knowledge captured from broad training data. In other domains, Narayan et al. (2022) showed that models like GPT-3 DaVinci can already perform classical data processing tasks such as cleaning, transformation, and error detection through prompting alone. Kayali et al. (2024) introduced Chorus that shows that LLMs can analyze heterogeneous tabular data without task-specific training. Chorus demonstrates that by converting tables into a standardized text format and using zero-shot prompting (i.e., prompts with no examples), LLMs can flexibly analyze tables even when they differ in structure, column names, or data types.

Figure 5.6: Static conventional data analysis workflow vs. dynamic GPM generated workflow. The chemical analysis can be done with a variety of possible instruments and techniques, resulting in a large number of possible output data formats. The GPM can use these diverse, raw data and process it into easy-to-understand plots, analysis and reports. A hard-coded workflow, in contrast, is specifically made to analyze one specific data format and spectra and produces a fixed output format, e.g., the simplified molecular input line entry system (SMILES) of the analyzed molecule.

5.7.1 Prompting

Initial evaluations demonstrated that GPMs can support basic data analysis workflows. [Fu et al. (2025)] For example, in chemistry, this enabled the classification of X-ray photoelectron spectroscopy (XPS) signals [Curtò et al. (2024)] based on peak positions, intensities, or characteristic spectral patterns).

Spectroscopic data are not always available in structured textual form. In many practical cases, it appears as raw plots or images, making direct interpretation by vision language model (VLM)s a more natural starting point for automated analysis. A broad assessment of VLM-based spectral analysis was introduced with the MaCBench benchmark [Alampara et al. (2024)], which systematically evaluates how VLMs interpret experimental data in chemistry and materials science—including various types of spectra such as infrared spectroscopy (IR), nuclear magnetic resonance (NMR), and X-ray diffraction (XRD)q—directly from images. They showed that while VLMs can correctly extract isolated features from plots, the performance substantially drops in tasks requiring deeper spatial reasoning. To overcome these limitations, Kawchak (2024) explored two-step pipelines that decouple visual perception from chemical reasoning. First, the model interprets each spectrum individually (e.g., converting IR, NMR, or mass spectrometry (MS) images into textual peak descriptions), and second, a LLM analyzes these outputs to propose a molecular structure based on the molecular formula.

5.7.2 Agentic Systems

Beyond zero-shot prompting of GPMs, one can develop agentic systems that combine multiple analysis steps end-to-end. In this regard, Ghareeb et al. (2025) developed Robin—a multi-agent system for assisting biological research with hypothesis generation (see Figure 5.3) and experimental analysis. The data analysis agent Finch performs autonomous analysis of raw or preprocessed experimental data, such as ribonucleic acid (RNA) sequencing and flow cytometry. Given a user prompt (e.g., “RNA sequencing differential expression analysis”), Finch executes code in a Jupyter notebook to process the data, apply relevant statistical methods, and generate interpretable outputs. For flow cytometry, this includes gating strategies and significance testing, while for RNA sequencing, it encompasses differential expression and gene ontology enrichment analysis. Currently, only these two data types are supported, and expert-designed prompts are still required to ensure reliable results.

Recent work extends agentic systems beyond single-step data evaluation toward executing and optimizing entire workflows. Mandal et al. (2024) introduced AILA (Artificially Intelligent Lab Assistant) utilizing LLM-agents to plan, code, execute, and revise complete atomic force microscopy (AFM) analysis pipelines. The system handles tasks such as image processing, defect detection, clustering, and extraction of physical parameters. Compared to systems like Finch, AILA shifts the focus from generating summaries to performing and improving full experimental analyses with minimal user input while maintaining transparency and reproducibility through code and reports.

5.7.3 Current Limitations

While GPMs offer promising capabilities for automating scientific data analysis, several limitations remain. Recent evaluations such as FMs4Code [Tian et al. (2024)] have shown that even state-of-the-art models like GPT-4-Turbo and Claude 2 frequently produce syntactically correct but semantically incorrect code when tasked with common data analysis steps, such as reading files, applying filters, or generating plots. Typical issues include incorrect column usage, or inconsistent output formatting.

These technical shortcomings are reinforced by the model’s sensitivity to prompt formulation. As demonstrated by Yan and He (2020) and Alampara et al. (2024), minor changes in wording or structure can lead to significantly different outputs, highlighting a lack of robustness in prompt-based control.

Together, these findings suggest that while foundation models can generalize across diverse data formats and analysis types, their current performance is not yet sufficient for fully autonomous use in scientific analysis settings. Robust prompting strategies, post-generation validation, and human oversight remain essential components in practice.

5.8 Reporting

To share insights obtained from data analysis, one often converts them into scientific reports. Also, in this step, GPMs can take a central role, which we discuss in the following.

Reporting refers to converting scientific results into shareable reports, scientific publications, blogs, and other forms of content. This section describes two main applications of LLMs in scientific reporting: converting data into explanations and the first steps towards using these models as fully-fledged writing assistants.

5.8.1 From Data to Explanation

The lack of explainability of ML predictions generates skepticism among experimental chemists[Wellawatte and Schwaller (2025)], hindering the wider adoption of such models.[Wellawatte, Seshadri, and White (2022)] One promising approach to address this challenge is to convey explanations of model predictions in natural language. An approach proposed by Wellawatte and Schwaller (2025) is to couple LLMs with feature importance analysis tools, such as shapley additive explanations (SHAP) or local interpretable model-agnostic explanations (LIME). In this framework, LLMs can additionally interact with tools such as RAG over arxiv to provide evidence-based explanations.

5.8.2 Writing Assistance

When considering ML-based assistance in scientific writing, we can distinguish two primary modes: systems that aid authors during the active writing process and tools that optimize or refine scientific articles after initial drafting.

The former refers to the use of writing copilots that can suggest syntax improvement, identify text redundancies,[Khalifa and Albadawy (2024)] caption figures and tables[Hsu, Giles, and Huang (2021); Selivanov et al. (2023)], or provide caption-figure match evaluation[Hsu et al. (2023)], but also more specific applications like writing alt-text (descriptive text that explains the meaning and purpose of an image in digital content)[Singh, Wang, and Bragg (2024)].

Under the latter mode, GPM can be used to assist non-native English speakers with scientific writing [Giglio and Costa (2023)]. It could even allow authors to write in their native language and use GPM for communicating scientific results in English.

Another application of LLM is to assist with completing checklists before submitting a publication. For example, Goldberg et al. (2024) benchmark the use of LLMs in completing the author checklist for the Conference on Neural Information Processing Systems (NeurIPS) 2025. They concluded that \(70\%\) of the authors found the LLM-assistant useful, with the same fraction indicating they would revise their own checklist based on the model feedback.

5.8.3 Vision

Few have ventured into fully automating the writing process.[Yamada et al. (2025)] While at its inception, reporting using GPM has tremendous potential. In Figure 5.7 we showcase how the future of reporting could look like if we were to integrate GPM at each step of the process.

Figure 5.7: Vision for GPM in reporting, a visualization of the scientific writing process. gpms can be used at every stage of the process. For creating the pre-print, we can utilize the multimodal capabilities of these models to write detailed captions for figures. For the peer-review process, we can harness the ability of GPMs to summarize and prioritize information (e.g., design a time-efficient plan to address the peer review). When converting a document from a peer-reviewed pre-print, we often need to implement the publisher’s requirements. In this case, we can make use of agentic systems that would assist with minor text fixes or document restructuring.

An idea entertained by Li et al. (2023) in the context of education is personalized writing. However, it is still widely unexplored in its goal: to make science accessible to everyone. A personalized model that learns user preferences and domain expertise can be used to deliver the message of a scientific article in simpler terms. As a result, we might observe a rise in cross-domain scientific collaborations and a rising interest in science.