AI Tools Glossary 2026: Every Term You Actually Need to Know
The AI space moves fast, and the vocabulary moves even faster. Every week there's a new acronym, a new architecture, or a new product category with its own jargon. This glossary covers the terms that keep coming up when you're reading about AI tools, building with language models, or trying to understand what vendors are actually selling.
Definitions are kept practical. If a term has a common misuse, that's noted too. Cross-links go to specific products or deeper guides where relevant.
A
Agent (AI Agent)
Software that uses a language model to take multi-step actions toward a goal. Unlike a chatbot that generates one reply and stops, an agent decides what to do next, calls tools, reads the results, and keeps going until the task is done. See the full AI agent explainer for the anatomy in detail.
Alignment
The process of making an AI system's behavior match intended human values. A well-aligned model should be helpful, honest, and avoid causing harm. Alignment is an active research area because large models can produce plausible-sounding but harmful or misleading outputs that basic training doesn't catch.
Attention Mechanism
The part of a transformer model that decides which tokens in the input are most relevant to each other. Attention is what lets a model understand that "it" in "the cat sat because it was tired" refers to the cat, not the mat. The quality of a model's attention architecture directly shapes how well it handles long documents and multi-step reasoning.
Autoregressive Model
A model that generates output one token at a time, where each new token is predicted based on all the tokens that came before it. GPT-series models, Claude, and most modern chat models are autoregressive. The implication: the model can't go back and revise an earlier token once it's generated, which is why output quality sometimes degrades over long completions.
B
Base Model
A language model trained on a large text corpus without any instruction tuning or alignment adjustments. Base models are good at completing text in the style of their training data, but not at following instructions or acting as assistants. Most deployed models are fine-tuned versions of a base model. The base model is the foundation.
Batch Inference
Running many model requests simultaneously rather than one at a time, typically to reduce cost. Providers like Anthropic and OpenAI offer batch APIs where you submit a list of prompts and get results back asynchronously, often at 50% lower cost than synchronous requests. Useful for processing large datasets where real-time response isn't needed.
Bias (Model Bias)
Systematic errors in model outputs that often reflect patterns in training data. A model trained mostly on English text from Western sources will perform worse on other languages and may encode cultural assumptions that don't generalize. Bias isn't always obvious from testing, it tends to surface at edges of the distribution, in rare inputs, or in specific demographic groups.
BM25
A classic text retrieval algorithm based on term frequency and document length. Not a neural approach, it uses keyword statistics. BM25 is still widely used in RAG pipelines for hybrid search, where you combine it with semantic vector search to catch exact phrase matches that embedding-based retrieval sometimes misses.
C
Chain-of-Thought (CoT)
A prompting technique where you ask a model to show its reasoning step by step before giving a final answer. Models that reason out loud make fewer errors on multi-step problems than models that jump straight to a conclusion. "Think step by step" in a prompt is the blunt version of CoT. More structured implementations walk the model through a specific reasoning format.
Context Window
The maximum amount of text a model can process in a single request, including both your input and its output. Context windows are measured in tokens. In 2026, production models range from 128K tokens (GPT-4o Mini) to 2 million tokens (Gemini 2.5 Pro). A larger context window means the model can read entire books, large codebases, or long conversation histories at once. See context window explainer for current sizes.
ControlNet
An architecture that adds spatial control to image diffusion models. Where a standard text prompt influences the general composition of an image, ControlNet lets you constrain the output to match a specific pose, edge map, depth map, or line drawing. Used heavily in workflows combining Stable Diffusion with precise compositional control.
CUDA
NVIDIA's parallel computing platform, used to run neural network computations on GPUs. If you've seen "CUDA out of memory" errors, it means the model you're running is larger than your GPU's VRAM. Most local AI model runners (Ollama, llama.cpp) depend on CUDA for GPU acceleration on NVIDIA hardware.
D
Diffusion Model
A class of generative model trained by learning to reverse a noise process. During training, the model learns to denoise progressively noisier versions of images. At inference time, it starts from random noise and iteratively removes noise guided by a text prompt, arriving at a coherent image. Diffusion models are the engine behind Stable Diffusion, DALL-E, Midjourney, Flux, and Adobe Firefly.
DORA Metrics
Developer productivity metrics: Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Time to Restore Service. Not strictly an AI term, but AI coding agents like Claude Code and Devin are often evaluated on their impact to DORA metrics for engineering teams.
E
Embedding
A numerical vector representation of a piece of text. Words or sentences with similar meaning produce vectors that are close together in the vector space, which is what makes semantic search possible. You run a query through the same embedding model used to index your documents, then find the stored vectors nearest to the query vector. The quality of the embedding model determines how well "Can I return this?" matches a passage about your refund policy. Central to every RAG pipeline.
Evaluations (Evals)
Tests used to measure model performance on specific tasks or behaviors. A well-designed eval set lets you compare models, catch regressions after fine-tuning, and measure whether a prompt change actually improved outputs. Running evals before deploying a change is the equivalent of running unit tests before shipping code. The AI field has been building standardized eval benchmarks, MMLU, HumanEval, GPQA, to enable consistent comparisons.
F
Fine-Tuning
Continuing the training of a pre-trained model on a smaller, task-specific dataset. Fine-tuning adjusts the model's weights to make it better at a specific task, customer support, medical note summarization, code in a specific language, without training from scratch. The resulting model is smaller in terms of training cost but reflects the style and content of the fine-tuning dataset. Not the same as prompt engineering, which doesn't change the model's weights.
Foundation Model
A large model trained on broad data at scale, intended to be adapted to many downstream tasks. GPT-4o, Claude 4, and Gemini 2.5 are foundation models. The term was introduced by researchers at Stanford to distinguish these large general-purpose models from task-specific models trained for a single application.
Function Calling (Tool Use)
A feature in model APIs that lets a model indicate it wants to call a specific function with specific arguments, rather than generating prose. You define the available functions (name, description, parameters) and pass them to the model. The model reads them, decides which to call, and returns structured JSON with the call. Your code executes the function and passes the result back to the model. This is how most agent tool use works in practice.
G
Generative AI
AI systems that produce new content, text, images, audio, video, code, or 3D models, rather than just classifying or predicting from existing data. Every model in the image generation, video generation, and language model categories on Agentbrisk is generative AI. The term gets used broadly, sometimes to mean anything AI-related, but the specific meaning is models that generate rather than classify.
GGUF
A file format for quantized large language models designed for efficient local inference. If you're running models locally with Ollama or llama.cpp, you're probably using GGUF files. The format stores the model weights along with metadata needed for inference, making it portable across different hardware configurations.
Guardrails
Constraints applied to model inputs or outputs to prevent harmful, off-topic, or low-quality responses. Guardrails can be implemented as system prompts, additional classifier models that check outputs before they're returned, or structured output schemas. Every production AI application needs some form of guardrails; the debate is about how tight to make them and at which layer to apply them.
H
Hallucination
When a model generates information that sounds plausible but is factually wrong or made up. A model asked about a real person might invent biographical details that don't exist. A coding model might reference a library function that was never implemented. Hallucination is a structural property of language models, they generate likely-sounding text, not text that's been verified against external facts. RAG and grounding techniques reduce hallucination but don't eliminate it.
HuggingFace
A platform for sharing open-source AI models, datasets, and tools. HuggingFace hosts tens of thousands of models, including most major open-weight language and image generation models. If you're looking for a fine-tuned model variant, an embedding model, or a specific checkpoint of a diffusion model, it's probably on HuggingFace.
Hybrid Search
A retrieval strategy that combines dense vector search (semantic similarity via embeddings) with sparse keyword search (BM25 or TF-IDF). Hybrid search catches both conceptually similar passages and exact phrase matches, outperforming either approach alone in most RAG applications. Weaviate and Elasticsearch both support hybrid search natively.
I
In-Context Learning
The ability of a large model to learn a task from examples provided in the prompt, without any weight updates. Show the model three examples of the output format you want, and it will follow that format for the next input. This is qualitatively different from fine-tuning, the model adapts to the examples at inference time without any training.
Inference
Running a trained model to generate output. In contrast to training (which updates model weights), inference just uses existing weights to process a new input and produce a result. "Inference cost" refers to the compute cost of generating responses, which matters for deployed applications. Inference speed is measured in tokens per second.
J
JSON Mode
A model output setting that constrains the model to produce valid JSON rather than free-form text. Most production model APIs offer a structured output or JSON mode option. This makes parsing model outputs reliable in code, instead of writing regex to extract values from a prose response. Function calling (see above) is a more powerful version of the same idea.
K
Knowledge Graph
A structured representation of entities and relationships, stored as nodes and edges. A knowledge graph might represent "Company A acquired Company B in 2022" as two entity nodes with a dated relationship edge. Some AI applications combine knowledge graphs with language models to answer structured queries that pure vector search handles poorly, especially questions about specific relationships, chains of facts, or provenance.
L
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning method that adds small trainable matrices to a model without modifying the original weights. LoRA is how most consumer-grade fine-tuned image models work, the "LoRA" files you download from Civitai or HuggingFace are small adapter files layered on top of a base model like Stable Diffusion. Training a LoRA takes much less compute than full fine-tuning, making it practical for individuals with a single GPU.
LLM (Large Language Model)
A neural network trained on large amounts of text to predict and generate natural language. The key word is "large", these models have billions of parameters and are trained on internet-scale data, which is what gives them general language capabilities. GPT-4o, Claude 4, Llama 3, Gemini 2.5, and Mistral are all LLMs. The term is sometimes used to mean any AI model, which is imprecise; LLMs specifically deal in language tokens.
Latency
The delay between sending a request to a model and receiving the first token back. For interactive applications, latency matters more than throughput. Time-to-first-token (TTFT) is the metric you care about for chat experiences. Smaller models and quantized models generally have lower latency than large ones running at full precision.
M
MCP (Model Context Protocol)
An open standard developed by Anthropic for connecting AI models to external tools and data sources. MCP defines a consistent interface so that an agent can connect to a search engine, a file system, a database, or any other tool without custom integration code for each one. See MCP explainer for the full picture.
Mixture of Experts (MoE)
A model architecture where only a subset of the model's parameters (the "experts") are activated for any given input. MoE models can be much larger in total parameter count while keeping inference cost manageable, because most parameters are dormant during any given forward pass. Mixtral and Google's Gemini 1.5 series use MoE architecture.
Multimodal
A model that processes more than one type of input, typically text plus images, and increasingly audio and video. GPT-4o, Claude 4 Sonnet, and Gemini 2.5 are multimodal: you can send them an image along with a text question and they'll respond to both. Multimodal capability matters for tasks like analyzing screenshots, describing product images, or understanding charts and diagrams.
N
NLP (Natural Language Processing)
The broader field of computer science dealing with text and language. Machine translation, sentiment analysis, named entity recognition, and text classification are classic NLP tasks. LLMs have absorbed much of what was previously handled by task-specific NLP models, but the terminology still appears frequently in academic papers and product documentation.
Negative Prompt
In image generation, a text prompt specifying what you don't want in the output. Negative prompts in Stable Diffusion-based models let you exclude common artifacts ("blurry, low quality, watermark") or unwanted compositional elements. Midjourney uses --no instead of a separate negative prompt field. Not all image generators support negative prompts, DALL-E uses natural language instructions ("do not include text in the image") instead.
O
Ollama
A tool for running open-weight language models locally on your own hardware. Ollama handles model downloads, CUDA setup, and serving a local API compatible with common client libraries. If you want to run Llama 3, Mistral, or other open models without sending data to a cloud provider, Ollama is the standard starting point.
Open-Source (Open-Weight)
A distinction that matters in AI: "open-source" models publish their training code and data; "open-weight" models publish the trained model weights but not necessarily the training details. Llama 3, Mistral, and Gemma are open-weight models, you can download and run them, but Meta and Google haven't published full training pipelines. Truly open models (Pythia, OLMo) release training code, data, and weights. Most "open-source AI" discussion is actually about open-weight models.
P
Parameter
A single trainable number in a neural network. When people say a model has "70 billion parameters," they mean it has 70 billion such numbers that were tuned during training. More parameters generally correlates with more capability, but also with higher memory requirements and slower inference. A 7B parameter model can run on a laptop; a 70B parameter model needs a high-end workstation or cloud GPU.
Perplexity
A statistical measure of how well a language model predicts a test text, lower perplexity means the model assigns higher probability to the test text, meaning it's better at predicting it. Used internally by model researchers but rarely relevant to end users. Not to be confused with Perplexity, the AI search product.
Prompt Engineering
The practice of writing and structuring inputs to get better outputs from a language model. Good prompt engineering isn't just rewording; it involves understanding how the model was trained, what context helps it reason accurately, and how to decompose complex tasks into steps the model can handle reliably. See also chain-of-thought, few-shot prompting, and system prompts.
Q
Quantization
Reducing the numerical precision of model weights to decrease memory footprint and increase inference speed. A model stored in 16-bit floating point can be quantized to 8-bit integers (Q8), 4-bit integers (Q4), or lower. A Q4 model uses roughly half the VRAM of a Q8 model and runs faster, with some quality loss. Quantization is what makes large models runnable on consumer hardware. The GGUF format stores quantized model weights.
R
RAG (Retrieval-Augmented Generation)
A pattern for giving a language model access to a large document corpus without loading everything into context at once. At query time, you retrieve the most relevant document chunks from a vector database and include them in the prompt. The model answers based on retrieved content rather than relying solely on its training data. RAG reduces hallucination on domain-specific questions and lets you update the knowledge base without retraining. Full implementation guide: how to build a RAG agent.
ReAct (Reason + Act)
A prompting and agent design pattern where the model alternates between reasoning ("I need to find the current price") and acting ("call the product lookup tool with SKU 12345"). ReAct loops are the engine behind most agent implementations. The model writes out a thought, decides on an action, executes it via tool call, reads the result, writes another thought, and so on.
RLHF (Reinforcement Learning from Human Feedback)
A training technique where human raters score model outputs, and those scores are used to train a reward model, which in turn is used to fine-tune the language model via reinforcement learning. RLHF is how most instruction-following and assistant-style models are trained after initial pretraining. It's why Claude, ChatGPT, and Gemini follow instructions in a natural way rather than just completing text.
S
Sampling Parameters
Settings that control the randomness of model output generation. Temperature is the most common: lower values (0.0-0.3) make outputs more deterministic and focused; higher values (0.7-1.0) make outputs more varied and creative. Top-P (nucleus sampling) controls which portion of the probability distribution is available for each token selection. Most applications use temperature 0.7 as a default and adjust from there.
Seed
A number used to initialize a random process. Setting the same seed in an image generator like Midjourney or Stable Diffusion with the same prompt and settings produces the same image. Seeds are useful for iterating on an image while changing specific parameters, because they give you a consistent starting point to compare from.
Stable Diffusion
An open-weight image diffusion model developed by Stability AI. Stable Diffusion was a turning point in generative AI because it released model weights publicly, enabling a large ecosystem of fine-tuned variants, LoRA adapters, and local deployment tools. Most consumer image generation tools are either based on Stable Diffusion or compete directly with it.
System Prompt
A prompt sent to a model that sets its behavior and context before the user's message. System prompts typically define the model's role, constraints, output format, and relevant context. In a customer service chatbot, the system prompt might specify the product the model knows about, the tone to use, and what topics are off-limits. System prompts are invisible to end users but shape every response.
T
Temperature
See Sampling Parameters above.
Tokens
The units a language model processes text in. Tokens aren't exactly words, punctuation, common word fragments, and spaces are all separate tokens. As a rough rule of thumb, 1 token is about 0.75 words in English. A 100,000 token context window holds roughly 75,000 words of text, or about 250 pages. Tokenization varies by model; code, non-English text, and rare words typically use more tokens per character than common English prose.
Transformer
The neural network architecture underlying virtually all modern language and vision models. Introduced in the 2017 paper "Attention Is All You Need," the transformer architecture uses attention mechanisms to process sequences in parallel. BERT, GPT, T5, and the models behind every major AI product are transformer variants.
U
Upscaling
Increasing the resolution of an image while adding realistic detail rather than just stretching pixels. AI upscalers like Topaz Labs and Magnific use diffusion-based models to infer and add texture at higher resolutions. The result is a 4K image from a 512x512 source that looks plausibly real rather than blurry. Common in post-processing workflows for Midjourney, DALL-E, and Stable Diffusion outputs.
V
Vector Database
A database optimized for storing and searching high-dimensional embedding vectors. Standard databases search by exact key or range; a vector database returns the nearest neighbors to a query vector based on cosine similarity or Euclidean distance. Common options include Pinecone (managed), Weaviate (open-source), Qdrant (open-source), and pgvector (PostgreSQL extension). Central infrastructure for any RAG implementation.
VRAM (Video RAM)
The memory on a graphics card. Running large AI models locally requires fitting the model's weights in VRAM. A 7B parameter model at 4-bit quantization needs about 4-5 GB of VRAM; a 70B model needs around 40 GB. Consumer GPUs max out around 24 GB (RTX 4090). This is the most common hardware bottleneck for local model inference.
W
Weights
The numerical parameters of a trained neural network that encode what the model learned during training. When you download a model, you're downloading its weights. Weights are the result of training and are fixed during inference, they don't update when you have a conversation with a model. Fine-tuning updates a model's weights on new data.
Whisper
OpenAI's open-weight speech recognition model. Whisper transcribes audio to text across multiple languages and dialects with high accuracy. It's used as the transcription backbone in many AI video and audio tools, including Descript, Opus Clip, and Captions AI.
X
XAI (Explainable AI)
Methods and techniques for understanding why a model made a specific prediction or generated a specific output. For classification models, XAI tools can highlight which input features most influenced the output. For language models, interpretability research tries to understand which attention heads and circuits are responsible for specific behaviors. XAI matters for regulated industries (finance, healthcare, legal) where model decisions need to be auditable.
Y
YAML
A human-readable data serialization format used for configuration files, structured prompts, and agent workflow definitions. YAML shows up frequently in AI tooling: LangChain and LlamaIndex both use YAML for pipeline configuration. If you're defining agent tools, workflow steps, or evaluation schemas, you'll encounter YAML. It's essentially a more readable version of JSON, minus the curly braces.
Z
Zero-Shot
Asking a model to perform a task without providing any examples. A zero-shot prompt gives the model only a task description and trusts it to figure out the output format and approach. Zero-shot works well for common tasks that appeared frequently in training data; for specialized or unusual tasks, providing a few examples (few-shot) usually produces better results. The distinction matters for prompt engineering: if zero-shot outputs are inconsistent, adding two or three examples often fixes the problem.
Zero-Shot Transfer
A model's ability to apply learned capabilities to tasks it wasn't explicitly trained on. A model trained on image classification can sometimes describe images it's never seen in a category that didn't exist during training. Zero-shot transfer is part of why large foundation models are useful across so many tasks, they generalize from training data to novel situations more effectively than smaller, task-specific models.
This glossary is updated as the field evolves. If a term comes up repeatedly in AI tool documentation and isn't here yet, it's a candidate for the next revision. For practical application of these concepts, the RAG guide and the AI agent explainer go deeper on the architecture behind many of these terms.