7  Implications of GPMs: Education, Safety, and Ethics

The advent of general-purpose model (GPM)s in the chemical sciences marks a paradigm shift that extends beyond methodological advances to fundamentally alter the conceptual frameworks through which scientific knowledge is produced and validated. As these models permeate education and research, their transformative potential is closely linked to critical challenges in cultivating discerning learners, mitigating emergent risks in automated discovery, and navigating ethical dilemmas arising from biased systems. Here, these tripartite implications are examined, and it is argued that responsible integration of GPMs requires not only technical innovation but also rigorous pedagogical and regulatory frameworks.

7.1 Education

7.1.1 Vision

GPMs are opening up new directions to create or use educational materials (see Figure 7.1). Although many current applications are still in their conceptual stage, they begin to highlight the potential of these models to personalize learning, increase the fairness of evaluations, and improve accessibility. GPMs can support both students and educators at various stages of the learning process and across different media forms, indicating a shift toward more adaptive and personalized educational frameworks. [E. R. Mollick et al. (2024)]

For students, GPMs could act as intelligent companions in a variety of tasks. As adaptive tutors, they could tailor explanations, exercises, and feedback to individual needs. Additionally, they could help students rehearse for exams and presentations and deepen their conceptual understanding. [E. Mollick and Mollick (2024); Sharma et al. (2025); J. Wang and Fan (2025)]

Figure 7.1: Possible application examples for students and teachers and their specific limitations for gpms in chemical education. GPMs can be used by students as scientific assistants (e.g., for data analysis, coding or lab experiments) and tutors (e.g., for question answering, teaching or exercises). Teachers can use GPMs for evaluation (e.g., for grading and detailed feedback) or to personalize materials (e.g., for lectures and other learning materials). Current limitations include an over-reliance and a lack of critical assessment of the outputs, a lack of ready-to-use products, hallucination, a lack of reliability of the models, and a technology knowledge deficit of the users.

In more practice-oriented settings, such as laboratories or coding environments, these models could function as scientific assistants. They may provide real-time feedback on experimental setups, assist with how to use lab equipment, or help identify potential safety concerns.[Du et al. (2024)] Coupled with augmented reality (AR), GPMs could also enable immersive simulations of lab procedures, allowing students to familiarize themselves with workflows and instruments before entering a physical lab. Moreover, by supporting students in technical areas such as coding, data analysis, or simulations, they could reduce the entry barrier or learning curve and thus foster interdisciplinary competence—particularly in contexts where instructor support is limited.

At the same time, GPMs could ease the workload of educators. They offer new possibilities for generating and adapting course materials—ranging from lecture slides and exercises to individualized exam questions—thus enabling a better alignment with diverse learning levels and prior knowledge. In assessment tasks, these models could support the grading of open-ended responses by providing consistent, criteria-based feedback and reducing subjective bias.[Kortemeyer, Nöhl, and Onishchuk (2024); Gao et al. (2024)] This is especially valuable in large courses or when timely, detailed feedback would otherwise be difficult to provide.

7.1.2 Current Status

Despite these envisioned potentials of GPMs in chemistry education, current applications are often still fragmented and lack integration into cohesive educational systems. In many cases, models are used via general-purpose interfaces without subject-specific customization or alignment with curricular goals. Rather than being part of purpose-built tools or platforms, their use remains largely exploratory.

7.1.2.1 General Systems

Early applications often rely on zero-shot prompting of general-purpose large language model (LLM)s or vision language model (VLM)s to aid with student-oriented learning tasks. These include plotting data [Subasinghe, Gersib, and Mankad (2025)], writing code [Tsai, Ong, and Chen (2023)], or generating analogies to explain abstract chemical concepts [Shao et al. (2025)].

Handa et al. (2025) analyzed over 570,000 anonymized Claude.ai conversations from university-affiliated users, finding that students primarily used the model for preparing learning materials (\(39.3\%\)) and solving academic problems (\(33.5\%\)). However, their use was often exploratory and low-stakes, and the study emphasizes the need for guidance, as many users lacked the expertise to evaluate model outputs critically.

Two recent studies have explored zero-shot prompting in more realistic, assessment-focused settings. Baral et al. (2025) introduced a benchmark of more than 2,000 student-drawn math images and found that state-of-the-art VLMs, including GPT-4o and Claude 3.5, struggled to assess student reasoning and correctness, particularly in open-ended or diagram-based answers. Similarly, Kortemeyer, Nöhl, and Onishchuk (2024) investigated the use of GPT-4 for grading handwritten thermodynamics exams at ETH Zürich. While the model performed reasonably well on short derivations, it failed to track detailed rubrics or interpret hand-drawn diagrams reliably. In chemistry education, Kharchenko and Babenko (2024) compared ChatGPT 3.5, Gemini, and Copilot (detailed model names were not specified) on domain-specific tasks and found that while the models performed adequately on simple recall questions, they failed in tasks that required chemical reasoning, structural understanding, or logical analysis.

Together, these findings suggest that while zero-shot prompting enables rapid deployment of GPMs in educational contexts, current models lack the reliability, consistency, and domain grounding required for chemistry education—not only from a pedagogical standpoint, but also from ethical and legal perspectives.

7.1.2.2 Specialized Systems

A growing number of applications embed GPMs in specialized educational systems that combine GPMs with structured components such as retrieval-augmented generation (RAG), tracking learning progress, or the ability to work with multiple sources of content such as textbooks, lecture slides, or handwritten notes.

For example, Perez et al. (2025) presented a biology questions & answers (Q&A) system that used a RAG pipeline to deliver curriculum-aligned answers. The I-Digest team [Jablonka et al. (2023)] introduced a platform that generates lecture summaries and follow-up questions to support continuous learning. Some systems also integrate specialized components to improve personalization and continuity in the learning process. Although not designed for chemistry specifically, general-purpose platforms such as TutorLLM [Li et al. (2025)] (generating personalized content based on the learning progress) and LearnMate [X. J. Wang, Lee, and Mutlu (2025)] (creating learning plans and giving feedback) demonstrate how large models can be embedded in structured educational frameworks.

Anthropic’s Claude for Education [Anthropic (2025a)] offers a purpose-built platform for higher education with a “Learning Mode” that uses the Socratic method to guide students rather than directly giving the answer to their queries. The intention here is to promote active learning and mitigate the risks of passive tool use. While such systems illustrate the potential of structured GPM-based learning environments, fully integrated applications remain rare—particularly in domain-specific contexts such as chemistry. Furthermore, current models must still strengthen their robustness and domain-specific reasoning before they can be trusted in demanding fields, including chemistry.

7.1.3 Outlook and Limitations

While GPMs offer promising opportunities for chemistry education, their use also raises critical ethical and pedagogical concerns.

A first concern is the lack of transparency in how these models generate responses. Although GPMs produce fluent and plausible explanations, they do so without genuine understanding—and often with no clear indication of uncertainty or possible errors. This can lead students to accept incorrect or misleading information as fact, particularly when the output appears confident or authoritative. Over time, such interactions may normalize uncritical acceptance and discourage students from questioning, verifying, or reflecting on what they are told. In scientific education, where reasoning, skepticism, and an evidence-based mindset are essential, this poses a serious threat to the development of informed and independent learners.[Marcus (2025); Kosmyna et al. (2025)]

A related but distinct issue is the risk of over-reliance and deskilling. When students delegate a large portion of the learning process to generative tools, they might be able to complete assignments without engaging in the mental work needed to develop subject-specific competence. In such cases, GPMs can disrupt the connection between concrete tasks—such as solving a problem or writing an explanation—and the broader educational goals they are meant to support, such as developing chemical understanding or analytical thinking skills.[Dung and Balg (2025); Sharma et al. (2025)]

Today—and even more so in the future—students and learners will increasingly use GPMs, both in education and in their future professions. Attempting to restrict their use is neither realistic nor educationally meaningful. Therefore, educators must guide their use thoughtfully and adapt both the learning process and assessment practices accordingly. Furthermore, the question of what to learn needs to be redefined. What we learn must increasingly center around the development of critical thinking, creativity, and logical reasoning—skills that remain essential and irreplaceable, even—or perhaps especially—in the age of GPMs. [Klein and Winthrop (2025)]

7.2 Safety

Figure 7.2: A conceptual schematic depicting artificial intelligence (AI) risk factors in chemical science. As one traverses through the game-like scientific process, there are various obstacles to encountering AI exacerbated risks. The path to superaligned chemical AI-assistants is obfuscated by unexplored chemical space.

A growing coalition within the scientific community has sounded a call to action: AI poses existential risks that deserve the same urgent attention as pandemics and nuclear war [“Statement on AI Risk CAIS (n.d.)]. This call, formalized in a statement signed by hundreds of prominent AI researchers, reflects mounting recognition that advanced AI systems could fundamentally alter, or threaten, human civilization. While the ongoing discourse has focused on abstract notions of artificial general intelligence (AGI), the immediate risks may emerge through the integration of AI into specific domains where the stakes are already high [Morris et al. (2023)].

Chemistry represents one such domain. The rapid integration of GPMs in chemistry is a dual-edged sword. Although these technologies can accelerate discovery, they also introduce unprecedented risks. From democratizing access to hazardous chemical knowledge to enabling autonomous synthesis of dangerous compounds, AI systems could lower barriers for misuse, whether intentional or accidental. GPMs alone may not create new risks [Peppin et al. (2024)], but they can amplify existing ones (see Figure 7.2).

Even in this amplified context, the ability of these systems to pose meaningful safety risks in practice is constrained by real-world limitations, including access to specialized lab equipment, regulated or scarce reagents, and, most critically, the “tacit knowledge”[Polanyi (2009)] required to execute complex chemical processes. Tacit knowledge, the expertise gained through hands-on experience and intuition, cannot be fully acquired from textbooks or datasets alone, as it is normally shared verbally and encompasses small learnings, often considered insignificant. This gap between theoretical AI outputs and practical execution underscores why risks, though serious, might remain manageable with proactive safeguards.

Ultimately, mitigating these threats requires a nuanced balance of fostering innovation while embedding safety at the architectural, operational, and governance levels.

7.2.1 Evaluating Risk Amplification in the Chemical Discovery Cycle

7.2.1.1 Dual Use

A critical question is whether LLMs provide maliciously acting novices with new avenues to obtain harmful knowledge beyond what is already easily accessible (e.g., via the internet)[Sandbrink (2023)]. Preliminary research has explored whether LLMs can exacerbate biorisks, a concern that extends analogously to chemical safety under the “information access” threat model [Peppin et al. (2024)]. Urbina et al. (2022) explored how their de novo molecular generator, MegaSyn, could be used to design toxic chemical agents by adjusting the reward system of the model to prefer compounds with greater toxicity and bioactivity. While their model predicted VX (toxic nerve agent) and other chemical warfare agents, the actual synthesis of such compounds requires expertise, controlled precursors, and specialized equipment that is far beyond the capabilities of most non-state actors.

Recent evaluations by OpenAI and Anthropic have systematically assessed how their models can facilitate the creation of biological threats. In an evaluation of their Deep Research System (a multi-agent architecture), OpenAI classified the system to be medium-risk for chemical and biological threat-creation. [OpenAI (2024)] As discussed previously, a key barrier that prevents such models from exceeding the assessed risk threshold is the acquisition of “tacit knowledge”. However, this system demonstrated modest improvements in troubleshooting and the acquisition of tacit knowledge. Although it still fell short of expert-level performance, these findings suggest that models are making progress toward overcoming this critical hurdle. They also evaluated GPT-4’s impact on experts and students across five biological threat creation stages: ideation, acquisition, magnification, formulation, and release. [OpenAI (2024)] Their key finding was that biorisk information is widely accessible without AI and that practical constraints such as wet lab access or domain expertise are more limiting than information scarcity.

Anthropic’s parallel assessment of Claude 4 Opus focused specifically on biological risks through red-teaming with bio-defense experts, multi-step agentic evaluations, and explicit testing of bioinformatics tool integration [Anthropic (2025b)]. Their findings align with OpenAI’s assessment of GPT-4, and conclude that current systems remain constrained by physical barriers. Both studies emphasize that supply chain control of chemicals, the flow of goods (e.g., chemical reagents) from suppliers to consumers, remains crucial as these systems continue to evolve and barriers to accessing knowledge are continuously lowered. For example, although He et al. (2023) showed that LLMs can generate pathways for explosives like pentaerythritol tetranitrate (PETN) or nerve agents like sarin, the supply chain of obtaining precursor chemicals for weapons like sarin is tightly regulated. In addition, access to lab infrastructure like fume hoods or inert environments is not trivial to obtain for non-experts.

In the status quo it remains true that these risks are mitigated by material and logistical hurdles [Sandbrink (2023)]. Nonetheless, existing information access or presumed barriers to accessing materials are not an argument for AI complacency. Models that lower the technical or cognitive barriers to weaponization even incrementally risk acting as force multipliers for malicious actors. Moreover, a red-teaming (discussed in Section 4.3.1.4) effort proved that these practical constraints are circumventable and show that real world checks are prone to failure.

7.2.1.2 Hallucinations

Another critical risk of GPMs is their propensity for hallucination, leading to factually incorrect outputs.[Pantha et al. (2024); Ji et al. (2023)] These errors risk propagating misinformation, such as inventing non-existent chemical reactions or falsifying safety protocols. Additionally, LLMs suffer from temporal misalignment; their static training data renders outputs obsolete in fast-evolving fields like drug discovery [Pantha et al. (2024)]. Therefore, their accuracy in chemistry decays sharply for research published after the training cutoff, underscoring the need for real-time verification systems.

7.2.1.3 Indirect Cyberattack Risk

The convergence of individual steps of the chemical discovery cycle in autonomous laboratory systems or cloud-based laboratories represents a high-risk scenario [Rouleau and Murugan (2025)]. Beyond traditional cybersecurity threats, these systems face a critical timeline mismatch: AI systems are projected to achieve superhuman hacking capabilities by 2027 while operating within inadequately secured infrastructure [Dean (2025)]. For autonomous laboratories, this means AI systems capable of designing hazardous compounds could be compromised by external actors.

7.2.2 Existing Approaches to Safety

Adversarial testing and red teaming have become prominent methods for evaluating the safety of AI systems, even in the chemical domain (see Chapter 4). While such evaluations are valuable to identify weaknesses in GPMs, they are inherently reactive. These approaches highlight failures only after a model is trained or deployed, rather than embedding safety into the model’s architecture. Moreover, adversarial testing is often unsystematic and relies on human-curated test cases that may not range across all potential risks, particularly in complex domains such as chemistry.

To move beyond reactive measures, an emerging field of safety research explores machine “unlearning”, a technique that selectively removes hazardous knowledge from a model’s training data [Barez et al. (2025)]. However, this approach faces significant challenges in chemistry. First, defining “dangerous chemical knowledge” is non-trivial because chemical properties are context-dependent. Seemingly benign compounds like bleach can become hazardous when combined or misused. Second, unlearning risks can lead to a model’s utility degradation, and the resulting challenge of balancing the trade-off between safety and functionality remains unresolved. This balance is especially challenging for GPMs, which must achieve broad applicability with strict safety constraints.

A more implicit safety strategy is alignment, which aims to steer model behavior toward human values through techniques like reinforcement learning from human feedback (RLHF) or Constitutional AI [Bai et al. (2022)]. Although alignment can reduce harmful outputs, it may not generalize well to novel or domain-specific threats. For instance, a chemically aligned model might refuse to synthesize a known toxin, but could still be manipulated into suggesting precursor chemicals. Moreover, even after undergoing alignment training, they are still prone to produce risky and harmful content and will always be susceptible to jailbreaks [Kuntz et al. (2025); Yona et al. (2024); Lynch et al. (2025)].

In an effort to train trustworthy models, many AI researchers have turned to “interpretability”, an approach that aims to explain how the computations GPMs are linked to the output. [Cunningham et al. (2023)] However, in a recent global evaluation of AI safety, Bengio et al. (2025) argue that state-of-the-art (SOTA) interpretability tools have not proven their reliability in understanding models to modify them to alleviate safety risks.[Makelov, Lange, and Nanda (2023)]

7.2.2.1 Challenges in Developing Safeguards

Even chemistry-specific LLMs agents like ChemCrow [Bran et al. (2024)] or Coscientist[Boiko et al. (2023)] exhibit vulnerabilities. For instance, ChemCrow’s safeguards block known controlled substances, yet He et al. (2023) demonstrated a flaw in its safety protocols. The agent’s refusals are reactive rather than proactive, as they rely on post-query web search checks rather than embedded safeguards.

The capabilities and ensuing risks posed by GPMs need to be contextualized around user intent [Tang et al. (2024)]. Under a paradigm of malicious intent, the user intentionally creates a dangerous situation or application using the tool. However, an uninformed, benign user (e.g., a chemistry undergraduate student) could be unaware of the dangers posed by a given model output and unable to differentiate hallucinated responses.

7.2.3 Solutions

The gaps in current AI safety measures reveal a pressing need for proactive frameworks that address technical and systemic risks. [Bengio et al. (2025)]

7.2.3.1 Regulatory Framework for Chemical AI Models

Drawing from emerging biosecurity governance models, as a first step, chemical AI oversight should focus on a narrow class of “advanced chemical models” that meet specific risk thresholds. Similar to proposed biological model regulations, these could include models trained on not widely accessible, particularly sensitive chemical data [Bloomfield et al. (2024)]. This targeted approach is preferable to regulating all chemical AI models because it avoids creating compliance burdens that would disproportionately affect low-risk research while capturing the systems that actually pose security concerns.

7.2.3.2 Existing Institutional Efforts

Recent governmental initiatives have begun to address AI safety concerns. The US AI Safety Institute [NIST (2024)] and the UK AI Safety Institute have been tasked with designing safety evaluations for frontier models and researching catastrophic risks from AI systems. In contrast, the European Union (EU) has taken a more pragmatic regulatory approach through the EU AI Act [EU (2024)], which classifies AI systems by risk levels and imposes obligations ranging from transparency requirements to prohibited uses. These nascent efforts, while promising, face significant limitations. National institutes operate within frameworks that may prioritize domestic interests over global safety. While more comprehensive in scope, the EU AI Act focuses primarily on general AI applications rather than domain-specific risks such as chemical synthesis, and its risk classification system may not adequately capture the unique dual-use nature of chemical AI models.

7.2.3.3 Future Institutional Oversight and Transparency

A critical step forward is establishing neutral and independent regulatory bodies to oversee AI development. Unlike self-regulation, which risks conflicts of interest, an International Artificial Intelligence Oversight organization (IAIO) comprised of AI researchers, policymakers, ethicists, and security experts could harmonize standards and prevent a “race to the bottom” in regulatory laxity [Trager et al. (2023)]. Such a body could mandate pre-approval for high-risk AI research (like AGI-aligned projects), similar to institutional review boards that exist in the biomedical research space [Pistono and Yampolskiy (2016)]. Precedents for this exist in The European Organization for Nuclear Research (CERN), which tries to balance civilian duties with dual-use risks and nuclear non-proliferation treaties that tie market access to compliance.[CERN (2024)] Effective governance also requires binding enforcement mechanisms. One approach is conditional market access, where AI products and precursors can only be traded internationally if certified by the IAIO. Participating states would then enact domestic laws to align with these standards, ensuring corporate and national compliance. This model leverages economic incentives rather than voluntary guidelines to enforce safety. Transparency must also be enforced, particularly in red-teaming and safety testing. Currently, many companies conduct red-teaming privately, with no obligation to disclose the findings or corrective actions. To address this, AI developers should be required to publicly report red-teaming results and undergo third-party safety audits for high-stakes applications.[“Career Update: Google DeepMind -> Anthropic” (2025)]

The risks posed by AI in chemistry demand immediate action from governments and institutions. Yet these challenges also invite a deeper question: Is the integration of AI into scientific discovery fundamentally advancing our understanding or merely accelerating the production of epistemic noise? As Narayanan and Kapoor (2025) cautioned, AI’s predictive abilities often obscure its inability to explain underlying mechanisms. In chemistry, this tension is acute: AI-driven tools may optimize reactions or design toxins with equal ease, but their black-box nature complicates accountability and obscures causal relationships. True progress may require safeguards against misuse and a reevaluation of whether AI’s role in science should be expansive or deliberately constrained.

7.3 Ethics

The deployment of GPMs in the chemical sciences raises several critical ethical concerns that require careful consideration. These issues range from perpetuating harmful biases to environmental impacts and intellectual property concerns [Crawford (2021)].

7.3.1 Environmental Impact and Climate Ethics

The computational requirements for training and deploying GPMs contribute to environmental degradation through excessive energy consumption and carbon emissions. [Spotte-Smith (2025); Board (2023)] These computational resources are often powered by fossil fuel-based energy sources, which directly contribute to anthropogenic climate change. [Strubell, Ganesh, and McCallum (2019)] The emphasis on AI research has superseded some of the commitments made by big technological companies to carbon neutrality. For example, Google rescinded its commitment to carbon neutrality amid a surge in AI usage (65\(\%\) increase in carbon emissions between 2021-24) and funding.[Bhuiyan (2025)] Additionally, the water consumption for cooling data centers that support these models is another concern, particularly in regions facing water scarcity. [Mytton (2021)]

The irony is particularly stark when considering that in the chemical sciences, these models are used to address climate-related challenges, such as the development of sustainable materials or carbon capture technologies. As a scientific community, we must grapple with the questions about the sustainability of current AI development trajectories and consider more efficient and renewable approaches to model development and deployment. [Kolbert (2024)]

7.3.3 Bias and Discrimination

GPMs inherit and amplify harmful prejudices and stereotypes present in their training data, which pose significant risks when applied translationally to medicinal chemistry and biochemistry. [Spotte-Smith (2025); Yang et al. (2024); Omiye et al. (2023)] These models can perpetuate inaccurate and harmful assumptions based on race and gender about drug efficacy, toxicity, and disease susceptibility, leading to misdiagnosis and mistreatment. [Chen et al. (2023)] Historical medical literature contains biased representations of how different populations respond to treatments, and GPMs trained on such data can reinforce these misconceptions. [Mittermaier, Raza, and Kvedar (2023)] The problem extends to broader contexts in chemical research. Biased models can influence research priorities, funding decisions, and the development of chemical tools in ways that systematically disadvantage the most vulnerable populations [Dotan and Milli (2019)].

7.3.3.1 Solutions

The problem of bias can be best addressed through top-down reform. The data necessary to train unbiased models can only exist if clinical studies of drug efficacy are conducted on diverse populations in the real world.[Criado-Perez (2019)] To complement improved data collection, standard evaluations for bias testing must be developed and mandated prior to deployment of GPMs.

7.3.4 Democratization of Power

Although AI tools have the potential to democratize access to advanced chemical research capabilities, they may also concentrate power in the hands of a few large companies that control the frontier models. This concentration raises concerns about equitable access to research tools, particularly for researchers in smaller institutions with limited resources.[Satariano and Mozur (2025)]