Glossary

alignment: The process of ensuring that a machine-learning model’s behavior is consistent with human intentions, values, or task objectives. In the context of large language models (LLMs), alignment often involves fine-tuning with human feedback to produce helpful, honest, and harmless outputs. Alignment can also refer to aligning representations or embeddings across different modalities—such as text and images—in multi-modal systems, enabling meaningful cross-modal reasoning and retrieval.
agent: An entity that interacts with an environment by taking actions based on observations to achieve a goal. In reinforcement learning (RL), agents learn from feedback to optimize behavior over time. In the context of LLMs, agents can use tools, retrieve external information, and perform multi-step reasoning to complete complex, goal-oriented tasks.
application programming interface (API): A set of rules and protocols that allow different software systems to communicate with each other. APIs enable developers to access the functionality of external services or applications programmatically, often over a network. For example, the PubChem API allows programmatic access to chemical-compound data, enabling automated retrieval of molecular structures and properties.
benchmark: A standardized task or dataset used to evaluate and compare the performance of models or methods. Benchmarks help assess progress and identify strengths and weaknesses of different approaches.
breadth-first search (BFS): A graph-traversal algorithm that explores all neighbors of a node before moving to the next level of nodes. It is often used to find the shortest path in unweighted graphs.
bilingual evaluation understudy (BLEU): A metric for evaluating the quality of text generated by a model by comparing it to one or more reference texts. BLEU measures n-gram overlap between the generated output and the reference. Higher BLEU scores indicate closer matches.
bayesian optimization (BO): A global optimization strategy for expensive black-box functions. It does not require access to derivatives of the objective function. BO builds a probabilistic surrogate model, typically a Gaussian process (GP), to guide the selection of query points by balancing exploration and exploitation.
convolutional neural network (CNN): A type of neural network (NN) designed to process data with a grid-like structure, such as images. CNNs use convolutional layers to automatically learn spatial hierarchies of features from the input.
chain-of-thought (CoT): A prompting strategy that involves a series of intermediate natural-language reasoning steps that lead to the final output. CoT encourages an LLM to explain its reasoning step by step (e.g., by prompting it to “think step by step”), and it is intended to improve the ability of LLMs to perform complex reasoning.
compositionality: When a complex structure can be constructed by combining simpler components, it is said to have the compositional property. In the context of machine learning (ML), a model or function (f(x,y)) exhibits compositionality if it can be constructed from (h(x)) and (k(y)) as its components: (f(x,y)=g[h(x),k(y)]), where (h), (k), and (g) are learned functions. This structure allows modularity, parameter sharing, and generalization to novel input combinations by leveraging the learned behavior of the components. In many deep-learning architectures, compositionality is achieved by the hierarchical application of simpler transformations, such as feed-forward neural networks (FNNs).
computational scaling: The way in which the computational cost of an algorithm or model increases with problem size (e.g., the number of atoms, data points, or parameters). Understanding scaling behavior is important for assessing feasibility and performance at larger scales. For example, traditional implementations of density functional theory (DFT) scale as ((N^3)) with the number of basis functions, making large systems computationally expensive.
context window: The span of input tokens that a model considers at once when generating predictions. In LLMs, the context window defines how much preceding (and possibly surrounding) text the model can attend to. Larger context windows allow the model to capture longer dependencies but increase computational cost.
dry age-related macular degeneration (dAMD): A chronic eye disease that causes gradual loss of central vision due to the thinning of the macula. dAMD is the more common and less severe form of age-related macular degeneration, typically progressing slowly over time.
data augmentation: A technique used to artificially increase the size and diversity of a training dataset by applying transformations or generating variations of the original data. Data augmentation helps improve model generalization and robustness.
data chunk: A contiguous block or segment of data treated as a unit for processing, storage, or transmission. Data chunks are used to divide large datasets into manageable parts—for example, when streaming text or segmenting long sequences. Unlike a batch, which typically refers to a set of independent samples processed together in training, a data chunk often preserves sequential or structural continuity within a single sample.
de novo design: The process of generating novel molecules or materials from scratch, guided by desired properties or objectives rather than by modifying known structures.
dense/sparse vector: Two categories of vector representations used in machine learning (ML) and data processing. Dense vectors have most or all elements non-zero and are typically used for learned embeddings. Sparse vectors contain mostly zero values and are common in high-dimensional representations such as one-hot encodings (OHEs). Sparse vectors are more memory-efficient when stored in specialized formats, while dense vectors are preferred for neural computation because they contain richer information.
depth-first search (DFS): A graph-traversal algorithm that explores as far as possible along each branch before backtracking. DFS is often used for pathfinding, cycle detection, and analyzing graph structures.
deep learning (DL): A subfield of machine learning (ML) that uses neural networks (NNs) with many layers to model complex patterns in data. Deep learning has enabled major advances in image recognition, natural-language processing, and molecular modeling.
weight-decomposed low-rank adaptation (DoRA): An extension of low-rank adaptation (LoRA) that separates weight adaptation into magnitude and direction components. DoRA achieves efficient fine-tuning of LLMs by modifying only the direction of weights while preserving the pre-trained magnitude, improving stability and performance.
direct preference optimization (DPO): A method for fine-tuning LLMs based on human preference data without using reinforcement learning. DPO directly optimizes the model (policy) to prefer outputs ranked higher by human annotators, simplifying the reinforcement learning from human feedback (RLHF) pipeline.
domain-specific language (DSL): A programming language tailored to a particular application domain. DSLs offer specialized syntax and abstractions that make it easier to express solutions within that domain, such as chemical synthesis.
evolutionary algorithm (EA): A family of optimization algorithms inspired by natural selection. EAs evolve a population of candidate solutions over multiple generations using operations such as mutation, crossover, and score-based selection to find high-performing solutions.
extended connectivity fingerprint (ECFP): A type of molecular fingerprint that represents chemical structures as binary or count vectors based on local atomic environments. ECFPs are generated by iteratively hashing the neighborhoods of each atom up to a given radius, capturing information about connectivity and substructures. These fingerprints are invariant to atom indexing and are commonly used in machine-learning pipelines for tasks such as virtual screening, molecular similarity, and property modeling. A widely used variant is ECFP4, which uses a radius of 2.
embedding: A representation of discrete or high-dimensional data in a continuous vector space that preserves relevant relationships or structure. Embeddings are commonly used for words, molecules, graphs, and other symbolic data.
energy rank alignment (ERA): A method for aligning the training of generative models with energy-based evaluations. ERA encourages the model to assign higher probabilities to lower-energy (more favorable) configurations by matching the model’s ranking of samples to their energy scores.
F\(_1\) score: A performance metric for classification tasks that balances precision and recall. It is defined as the harmonic mean of precision and recall:
(F_1 = 2 ).
The F\(_1\) score ranges from 0 to 1. Because the harmonic mean is dominated by the smaller of the two numbers, a high F\(_1\) score means that both precision and recall are high, making this metric useful when classes are imbalanced or when both false positives and false negatives are important.
fine-tuning: The process of taking a pre-trained model and continuing its training on a smaller, task-specific dataset to adapt it to a particular application. Fine-tuning updates the model parameters to specialize its behavior while retaining the general knowledge learned during pre-training.
feed-forward neural network (FNN): The simplest class of neural networks in which information flows in one direction from input to output through a series of connected layers, without cycles or feedback connections.
fused representation: A joint representation that integrates information from multiple modalities into a single embedding space. See latent fusion.
gating mechanism: A neural network component that controls the flow of information by modulating one signal using another, typically via element-wise multiplication. After computing a gate vector \(g(x)\), the gating mechanism applies it to an input signal \(z\) as \(z' = g(x) \odot z\), where \(\odot\) denotes element-wise multiplication. This allows the model to selectively suppress or pass through different components of \(z\) based on the learned gating function \(g\). Gating mechanisms are central to architectures like LSTMs, where they regulate memory updates, retention, and output generation.
genetic algorithm (GA): This subset of evolutionary algorithms models candidate solutions as individuals represented by strings of “genes” (e.g., binary digits or symbols). Unlike broader evolutionary algorithms, GAs focus heavily on genetic representations and recombination to explore the search space.
graph neural network (GNN): A type of neural network architecture designed to operate on graph-structured data. GNNs learn representations by passing and aggregating information between neighboring nodes, making them well-suited for tasks involving chemical structures, such as molecules and crystals.
general-purpose model (GPM): A model that is designed to generalize across a wide range of tasks and domains with minimal task-specific modifications. General-purpose models are typically pre-trained on vast, diverse datasets using self-supervised objectives. They can be efficiently adapted to new tasks via prompting or finetuning. Examples include architectures such as large language models and vision-language models.
gaussian process (GP): A non-parametric probabilistic model used to define a distribution over functions. It is commonly used in Bayesian optimization and regression to make predictions with uncertainty estimates.
hallucination: The phenomenon where a model generates output that seems plausible but is factually incorrect or unsupported by the input or training data. Hallucinations are common in large language models and pose challenges for applications requiring reliability and factual accuracy.
hierarchical navigable small world (HNSW): An efficient algorithm for approximate nearest-neighbor search in high-dimensional spaces. It builds a graph-based data structure with multiple layers of navigable small-world graphs, where most nodes can be reached in a few steps through well-connected hubs. This structure allows fast and scalable similarity search. In chemistry, HNSW is often used for rapid retrieval of structurally similar molecules in large compound libraries.
in-context learning (ICL): In this method, large language models are adapted to perform new tasks at inference time by conditioning them on examples provided directly in the input prompt, without updating their parameters. The model uses patterns inferred from these examples to predict or generate appropriate outputs for new inputs.
international chemical Identifier (InChI): A textual representation for chemical substances developed by IUPAC, designed to provide a standard and machine-readable representation of molecular structures. InChI encodes information such as connectivity, hydrogen atoms, stereochemistry, and isotopes in a structured sequence of layers. The general format is InChI=version/layers.
inductive bias: The set of assumptions a learning algorithm uses to generalize from limited training data to unseen examples. Inductive bias guides which solutions a model is likely to prefer and can arise from model architecture, input representations, or training objectives. Hard inductive biases are built into the structure of the model and define the solution space, while soft inductive biases, which nudge the model to different parts of the solution space, are encouraged by training choices or priors but can be overridden by the data.
inference: In the context of large language models, inference is the computational process by which a trained model produces output tokens given an input sequence. Inference involves executing a forward pass through the model to evaluate the conditional probability distribution over possible next tokens, and then sampling from that distribution using a decoding strategy. Unlike training, inference does not involve gradient computation or parameter updates, and is often optimized for speed and throughput.
item response theory (IRT): This statistical framework was originally developed in psychometrics to model the relationship between a person’s latent ability and their probability of correctly answering test items. In machine learning, IRT is used to analyze model performance by treating test examples as items and estimating their difficulty, allowing for more nuanced evaluation than aggregate accuracy.
latent fusion: A method for integrating information from multiple modalities by combining their representations in a shared latent space. Latent fusion enables models to reason jointly over heterogeneous data, such as combining text and molecular structure embeddings for property prediction or generation tasks.
latent space: An abstract, typically lower-dimensional space where input data is represented after being transformed by a model. Latent spaces often aim to capture meaningful features or structures that are not explicitly present in the original data.
latent state: An internal, unobserved representation maintained by a model to capture relevant information about the input or sequence history. Latent states are used in models like recurrent neural networks and state-space models.
language-interfaced fine-tuning (LIFT): A method for fine-tuning models by framing structured tasks such as classification or regression as natural-language prompts. LIFT enables the use of large language models for supervised tasks without modifying the model architecture, leveraging prompt-based interfaces to adapt to diverse formats.
large language model (LLM): A language model that has a large number of parameters, usually on the scale of billions. For example, Llama 3, GPT-3, and GPT-4 contain 70 B, 175 B, and 1.76 T parameters, respectively, while Claude 3 Opus is estimated to have 2 T parameters. Most current large language models are based on the transformer architecture. LLMs can perform many language tasks, such as generating human-like text, understanding context, translation, summarization, and question answering.
language model (LM): A model that estimates the probability of a token or sequence of tokens occurring in a longer sequence of tokens. This probability is used to predict the most likely next token based on the previous sequence. Language models are trained on large datasets of text, learning the patterns and structures of language to understand, interpret, and generate natural language.
local-env: A textual representation of crystal structures, inspired by Pauling’s rule of parsimony and designed to leverage the structural redundancy often found in the local atomic arrangement of crystals. It begins with the crystal’s space group, followed by a list of distinct coordination environments, each specified by its Wyckoff label and a corresponding SMILES string.
low-rank adaptation (LoRA): A parameter-efficient fine-tuning technique that freezes the pre-trained model weights and decomposes the update matrix into two lower-rank matrices that contain a reduced number of trainable parameters to be optimized during finetuning.
loss function: Optimization processes often try to minimize a loss function, which is a mathematical function that quantifies the difference between a model’s prediction and the true target. It guides the optimization process during training by indicating how well the model is performing.
long short-term memory (LSTM): A type of recurrent neural network architecture designed to capture long-range dependencies in sequential data. LSTMs use gating mechanisms to control the flow of information, making them effective for tasks like language modeling, time-series prediction, and speech recognition.
monte carlo tree search (MCTS): A heuristic search algorithm used for decision-making in sequential environments, particularly in games and planning problems. MCTS incrementally builds a search tree by simulating many random playouts from different states to estimate action values. It typically proceeds through four phases: selection, expansion, simulation, and backpropagation. MCTS has been notably used in systems like AlphaGo for planning in high-dimensional, sparse-reward environments.
machine learning (ML): This field of computer science is focused on developing algorithms that enable systems to learn patterns from data and make predictions or decisions without being explicitly programmed.
machine-learning interatomic potential (MLIP): A model that approximates the potential energy surface of atomic systems using machine learning techniques. MLIPs are typically trained on reference data from quantum-mechanical calculations such as density functional theory (DFT), learning to predict both energies and forces. They serve as a data-driven alternative to classical force fields, enabling more accurate and transferable atomistic simulations at a fraction of the cost of ab initio methods.
modality: A form of data characterized by a particular sensory or representational channel, such as text, images, audio, or molecular graphs. In machine learning, handling multiple modalities enables models to integrate and reason across diverse data sources.
model temperature: The term originates from statistical physics, where temperature controls the entropy of a system. In the context of machine learning, temperature is a parameter used during sampling from a probabilistic model to control the randomness of the output distribution. Lower temperatures sharpen the distribution, making outputs more deterministic, while higher temperatures flatten it, increasing diversity. In large language models (LLMs), temperature is commonly tuned during text generation to balance creativity and coherence.
mixture of experts (MoE): A general modeling framework in which multiple specialized sub-models, or experts, are combined using a function that selects or weights their contributions based on the input.
message-passing neural network (MPNN): A class of graph neural networks (GNNs) that operate on graphs by iteratively updating node representations through the exchange of messages with neighboring nodes. MPNNs are widely used in molecular property prediction and other graph-structured tasks.
n-gram: A contiguous sequence of n tokens from a given text. N-grams are commonly used in language modeling and text analysis.
natural language processing (NLP): A subfield of computer science that uses machine learning to enable computers to process and generate human language. Primary tasks in NLP include speech recognition, text classification, natural language understanding, and natural language generation.
neural network (NN): One of the most prevalent model components used in deep learning, inspired by the structure of the brain. Neural networks are made up of layers of connected units called neurons. Each layer takes a vector x as input, performs a simple calculation f(x), and passes the result to the next layer. A typical layer can be represented as f(x) = σ(Wx + b), where W is a matrix of learned weights, b a learned bias term, and σ a non-linear activation function. By stacking many such layers, neural networks can approximate complex functions. See also: compositionality.
optical character recognition (OCR): A technique used to identify and convert images of printed or handwritten text into a machine-readable format. This involves segmentation of text regions, character recognition, and post-processing to correct errors and enhance accuracy.
one-hot encoding (OHE): A method for representing categorical variables as binary vectors, where each category is assigned a unique position set to 1 and all others are 0. One-hot encoding is commonly used to input discrete features into machine-learning models that require numerical input.
operating system (OS): Software that manages computer resources, providing an interface between hardware and software. The operating system handles tasks such as memory management, process scheduling, and device control.
planning domain definition language (PDDL): A formal language used to specify planning problems in automated planning. PDDL defines the initial state, goal conditions, and available actions with their preconditions and effects, enabling planners to generate sequences of actions to achieve the desired outcomes. It has been applied to domains such as robotic control and retrosynthetic planning.
parameter-efficient fine-tuning (PEFT): A methodology to efficiently fine-tune large pre-trained models without modifying their original parameters. PEFT strategies adjust only a small number of additional parameters during fine-tuning on a new, smaller dataset, significantly reducing computational and storage costs while achieving performance comparable to full fine-tuning. Common PEFT methods include Low-Rank Adaptation (LoRA), Quantized LoRA (QLoRA), and Delta-Orthogonal Rank Adaptation (DoRA).
policy: In reinforcement learning, a policy defines an agent’s behavior by mapping states to actions, typically denoted as π(a | s). A policy can be deterministic or stochastic and may be represented by a parameterized function such as a neural network. The goal of reinforcement learning (RL) is to learn a policy that maximizes expected cumulative reward.
proximal policy optimization (PPO): A reinforcement-learning (RL) algorithm used to train large language models (LLMs) for alignment. PPO adjusts the policy parameters while keeping changes within a predefined safe range to maintain stability and improve learning efficiency. PPO is often used as part of reinforcement learning from human feedback (RLHF).
prompting: The practice of providing input to a large language model (LLM) in the form of natural-language instructions, questions, or examples to guide its behavior. Prompting can influence the model’s responses without changing its parameters.
retrieval augmented generation (RAG): A technique for improving text generation by providing large language models (LLMs) with access to information retrieved from external knowledge sources. In practice, relevant retrieved text snippets are added to the prompt.
regular expression (regex): A sequence of characters that defines a search pattern for matching text. Regular expressions are widely used for tasks such as string validation, extraction, and substitution.
reinforcement learning (RL): In this machine-learning paradigm, an agent learns to take actions in an environment to maximize cumulative reward through trial and error. See also: policy.
reinforcement learning from human feedback (RLHF): A mechanism that uses reinforcement learning (RL) to align large language models (LLMs) with user preferences by fine-tuning on human feedback. Users rate the quality of model responses; a preference model is trained on these ratings and then used in an RL setup to optimize the LLM’s generations.
recurrent neural network (RNN): A type of neural network designed for processing sequential data by maintaining a hidden state that captures information from previous time steps. Without special mechanisms, RNNs struggle to capture long-range dependencies.
receiver operating characteristic—area under the curve (ROC-AUC): A performance metric for binary classifiers representing the area under the ROC curve, which plots true-positive rate against false-positive rate at various thresholds. A higher ROC-AUC indicates better class-distinguishing ability, with 1.0 being perfect and 0.5 representing random guessing.
semantic search: Rather than retrieving text based on exact keyword matching, semantic search is a technique that retrieves information based on meaning. Semantic search uses vector representations (see embedding) of queries and documents to capture contextual similarity, enabling more accurate retrieval of relevant results even when different words are used.
self-referencing embedded strings (SELFIES): A textual representation of molecular structures designed to be 100% robust, meaning every SELFIES string maps to a valid molecule. Unlike SMILES, SELFIES uses a grammar-based system to encode molecular structures in a way that prevents syntactic invalidity, making it especially useful for generative models.
simplified line-input crystal-encoding system (SLICES): A string-based representation of crystal structures in a human-readable and machine-interpretable form. SLICES aims to facilitate generative modeling in materials science by representing crystalline structures as linear sequences amenable to sequence-based models.
supervised fine-tuning (SFT): One of the simplest forms of the fine-tuning process, in which a pre-trained LLM is fine-tuned on a smaller, labeled dataset for a specific task.
simplified molecular input line entry system (SMILES): A non-unique textual representation of molecular structures, often small organic molecules, using a sequence of ASCII characters. SMILES encodes atoms and bonds in a linear form suitable for storage, search, and ML applications. The term canonical SMILES refers to a SMILES string that is generated using a deterministic algorithm (e.g., by RDKit) to ensure consistency.
state-of-the-art (SOTA): In the context of ML, this term is used to describe the most advanced models or techniques that represent the highest level of performance achievable today. They are typically the result of extensive research and development and serve as benchmarks for researchers and developers.
state space: The set of all possible internal configurations (states) a system or model can occupy. In ML, the state space defines how the system evolves over time and is used to model sequential behavior.
supervised learning: In this ML paradigm, the model is trained on labeled data, learning to map inputs to known outputs.
self-supervised learning (SSL): An ML technique that involves generating labels from the input data itself instead of relying on external labeled data. It has been foundational for the success of LLMs, as their pre-training task (next-word prediction or filling-in of masked words) is a self-supervised task.
selective state space model (SSM): A neural architecture that models sequential data using latent states and learned transitions, while selectively controlling which components of the state are updated at each step. SSMs aim to improve long-range reasoning and efficiency, and are used as alternatives to attention mechanisms in tasks like language modeling.
token: A basic unit of text used by LMs, typically corresponding to a word, subword, or character depending on the tokenizer. Models process and generate text as sequences of tokens.
transferability: The degree to which knowledge learned in one setting—such as a task, domain, or data distribution—can be effectively reused or adapted in another. High transferability indicates that representations or models generalize well to new contexts, and it is a key objective in transfer learning, foundation models, and general-purpose architectures.
transfer learning: In this ML paradigm, knowledge gained from one task or domain is reused to improve performance on a different, often related, task. Transfer learning typically involves pre-training a model on a large dataset and then fine-tuning it on a smaller, task-specific dataset, making it particularly valuable in low-data settings.
tree-of-thought (ToT): This prompting and reasoning framework extends chain-of-thought by exploring multiple reasoning paths in a tree structure. ToT allows a model to evaluate and revise intermediate steps, enabling more deliberate decision-making in complex tasks.
unsupervised learning: In this ML paradigm, the model is trained on unlabeled data to discover hidden patterns or structure, such as clusters or latent variables.
vision-language model (VLM): A multi-modal model that can simultaneously learn from images and texts, generating text outputs.

Other Links