Show two versions to different users and measure which one performs better. The classic way to know if a change actually helped instead of guessing. Not strictly an AI term, but critical for evaluating features.
AI Glossary
Every AI term I've encountered in the wild.
Showing 163 of 163 terms
Turn off one piece of the system and see how much worse it gets. Tells you which parts are doing real work and which are along for the ride. The AI version of "if I take out this ingredient, does the dish still taste good?"
The percentage of predictions the model got right. Misleading on its own when one answer is much more common than the others — a model that always says "no fraud" is 99% accurate and useless.
The tiny decision inside each neuron about whether to fire and how strongly. Without it, a neural network is just a fancy line — with it, the network can learn curves and complex patterns.
An AI that takes action, not just answers. It reads your instructions, uses tools, checks results, and keeps going until the job is done.
The cycle an agent runs through: think, pick a tool, use it, read the result, decide what to do next. Repeats until the task is done or it gives up.
Adjective for AI that acts on its own initiative — chooses steps, calls tools, recovers from mistakes — instead of just responding to a single prompt. "Agentic workflow" means the AI runs the loop, not you.
A task you hand to an agent end-to-end instead of doing step by step yourself. You write the goal; the agent picks the steps, runs the tools, and reports back.
Getting AI systems to actually do what humans want, including the stuff humans forgot to say out loud. The hard part isn't the obvious rules — it's the edge cases the model encounters that you never thought to specify.
The field of making sure AI systems don't cause harm — accidentally, on purpose, or at scale. Covers everything from preventing jailbreaks today to long-term concerns about superhuman systems.
Low-quality AI-generated content flooding the internet — generic blog posts, fake product reviews, hollow LinkedIn takes. The textual equivalent of spam, but harder to filter because it sounds plausible.
A recipe for solving a problem — a fixed set of steps a computer follows. In ML the algorithm is how the model learns; the model itself is what the algorithm produces.
AI safety company that makes Claude. Founded in 2021 by ex-OpenAI researchers, known for the constitutional AI approach and a heavy emphasis on alignment research.
A defined way for one program to call another. When you use "the OpenAI API," you're sending text over the internet to their servers and getting a model's response back — same as a website fetches data, but for AI.
A secret string that proves you're allowed to use an API and tracks your usage for billing. Treat it like a password — if it leaks, anyone can spend your money.
Software that does tasks we used to think required a human brain — recognizing images, holding a conversation, planning a trip. The term is fuzzy on purpose; today it usually means systems built with machine learning.
The mechanism that lets a model look back at every other word in the input and decide which ones matter for predicting the next word. It's the core trick that makes transformers work.
A model that generates output one piece at a time, where each new piece depends on everything written before it. Every modern chatbot works this way — they don't plan the whole answer first, they write it word by word.
The algorithm that figures out which weights to nudge, and how much, after the model makes a wrong prediction. Without it, training neural networks at scale wouldn't be possible.
Sending many requests to a model at once instead of one at a time. Cheaper per request and faster overall — used when you don't need an instant answer, like overnight processing.
How many training examples the model looks at before updating its weights. Bigger batches train faster but need more memory; smaller batches are noisier but sometimes generalize better.
A standardized test that scores models so you can compare them. Useful for rough rankings; misleading when companies train specifically to beat the benchmark instead of doing real work better.
Google's 2018 language model that read text in both directions at once. It's not used for chat — it's the workhorse behind search ranking, classification, and the original generation of embedding models.
When a model systematically treats some groups worse than others — usually because the training data reflected existing inequities. "The model isn't biased, the world is" is technically true and operationally useless.
A score from 0 to 1 measuring how closely a model's translation matches a reference human translation. Old, crude, and still everywhere in machine translation papers because everyone agreed to use it.
When you fine-tune a model on new data and it loses what it used to know. Like teaching a chef French cuisine until they forget how to make a sandwich.
Asking a model to show its reasoning step by step instead of jumping to the answer. Often produces better results on hard problems — the model works through it the way a person would on a whiteboard.
The standard API format for chatbots: you send a list of messages with roles (system, user, assistant), the model returns the next assistant message. The shape of nearly every chatbot built on top of an LLM API.
OpenAI's consumer chatbot, launched November 2022. The product that made AI mainstream — within two months it had 100 million users and forced every other tech company to ship something similar.
Splitting long documents into smaller pieces so they fit through the embedding model and retrieve cleanly. Done badly, you split a sentence in half and the retrieval breaks; done well, each chunk is a coherent thought.
Predicting which bucket something falls into — spam or not, cat or dog, urgent or routine. One of the two most common things ML does, alongside regression.
Anthropic's family of large language models — competes directly with GPT and Gemini. Known for longer context windows, strong writing, and being trained with constitutional AI methods.
Anthropic's user-facing chatbot, plus the developer products built around the same models — the web app, desktop client, and Claude Code (the agentic coding tool). Same underlying models, different surfaces.
Renting computers from someone else's data center instead of owning your own. Nearly every AI model you use runs in the cloud — your prompt travels to AWS, Azure, or Google, and the answer comes back.
Grouping items by similarity without being told which groups exist — the model finds the patterns. Used for customer segments, anomaly detection, and organizing messy data.
What the model writes back when you give it a prompt. Old API style was "give the model some text, get the continuation"; modern chat APIs returned to the term for the assistant's response.
An agent that controls a computer the way a person does — moves the mouse, types, reads what's on screen. Slower and more error-prone than calling APIs, but works on software that has no API.
A less alarming word for what models do when they make stuff up — they're not lying, they're generating plausible-sounding continuations and don't have a separate "is this true?" check. Some researchers prefer it over "hallucination" because it's closer to what's actually happening.
A small table showing how many predictions were right, wrong, false alarms, or missed. The fastest way to see what kind of mistakes a classifier is actually making.
Anthropic's approach to training models to be helpful and harmless by having the model critique and revise its own outputs against a written set of principles. Replaces a lot of human-labeled examples with AI-generated ones.
The maximum amount of text — prompt plus response — the model can hold in working memory at once. Measured in tokens; modern models range from a few thousand to over a million.
Microsoft's family of AI assistants — GitHub Copilot for code, Microsoft 365 Copilot for Office, plus a growing list of others. Most are built on OpenAI models under the hood.
A way to measure how alike two vectors are by looking at the angle between them, not their length. The default math behind "find me the most semantically similar document."
Making your training data bigger by transforming examples you already have — flipping images, rephrasing sentences, adding noise. Helps the model generalize without you having to collect more data.
Whether your data — prompts, documents, personal info — gets stored, looked at by humans, or used to train future models. The answer varies by vendor and tier; read the policy, especially for enterprise tools.
A collection of examples used to train or test a model. The quality of the dataset determines almost everything about how the final model behaves — garbage in, garbage out, at planetary scale.
The half of a transformer that generates output one token at a time. Modern LLMs like GPT and Claude are "decoder-only" — they skip the encoder and just generate.
The process of turning the model's raw probability outputs into actual words you can read. Different decoding strategies — greedy, sampling, beam search — produce different styles of text from the same model.
Machine learning using neural networks with many stacked layers. "Deep" just means more than a couple of layers — the depth is what lets the model learn complicated patterns instead of simple ones.
A synthetic image, video, or audio clip that convincingly impersonates a real person. The technology is now good enough and cheap enough that detection is permanently behind generation.
The architecture behind most modern image generators (DALL-E, Stable Diffusion, Midjourney). It starts from random noise and gradually denoises it into an image, guided by your prompt.
Training a smaller model to imitate a bigger one. You lose a little quality, gain a lot of speed and cost — the trick behind most "mini" and "flash" model variants.
Running models on the device — your phone, laptop, factory sensor — instead of in a remote data center. Slower per model, but faster end-to-end because there's no network round trip, and your data stays local.
A list of numbers that captures the meaning of a piece of text (or image, or anything else) — so things with similar meaning end up with similar numbers. The math layer that makes semantic search work.
The half of a transformer that reads input text and turns it into internal representations. BERT-style models are encoder-only; useful for understanding text but not generating it.
A specific URL where an API listens — like a phone extension for a particular function. `api.openai.com/v1/chat/completions` is the endpoint that runs chat completions.
One full pass through the entire training dataset. Training usually takes many epochs — each pass nudges the weights a little closer to good predictions.
The European Union's 2024 law regulating AI systems by risk level — minimal, limited, high, or banned. The first major AI-specific regulation; sets the bar most multinational vendors will quietly conform to globally.
Tests you run to measure whether your AI system is actually doing its job. The discipline matters more than the tool — without evals you're shipping on vibes.
A single number that combines precision and recall — high when both are high, low if either is low. The go-to metric when you care equally about false positives and false negatives.
One column of input the model uses to make predictions — age, price, word count, anything quantifiable. Choosing the right features ("feature engineering") used to be most of the job before deep learning learned to do it itself.
Giving the model a few examples of the task inside the prompt before asking it to do a new one. Much more reliable than just describing the task, especially when the format matters.
Training an AI model on your own data so it gets permanently better at a specific task. Expensive, slow, and rarely needed — try better prompting and RAG first.
A big, general-purpose model trained on huge amounts of data that you then build specific applications on top of. GPT, Claude, and Gemini are foundation models; the chatbots and tools you use are built on them.
Letting the model decide when to call a piece of code — your function — and what arguments to pass. The mechanism underneath every "AI that does things" feature; same idea as tool use, often the same API.
Google's family of multimodal AI models and the chatbot built on them. Strong on long context and tight integration with Google's products (Gmail, Docs, Workspace).
AI that produces new content — text, images, audio, code — instead of just analyzing existing content. The umbrella term for everything ChatGPT, Midjourney, and their cousins do.
A small, carefully labeled set of examples that you trust to be correct — your ground truth for evaluation. You run every model change against the golden set to see if it got better or worse.
The family of language models behind ChatGPT — generative, pre-trained, transformer-based. The name has become so identified with OpenAI that "GPT" colloquially means their models specifically, not the architecture.
OpenAI's lineup of branded models — GPT-3.5, GPT-4, GPT-4o, and successors — sold via ChatGPT and the API. Each new number is a meaningful capability jump; the suffixes (o, mini, turbo) usually mean cheaper or faster variants.
A chip originally built for video game graphics that turns out to be perfect for the math behind neural networks. NVIDIA dominates this market, which is why their stock chart looks the way it does.
The optimization method behind nearly all model training: figure out which direction makes the model less wrong, take a small step that way, repeat millions of times. It's how the weights actually move during training.
Tying the model's answer to a specific source it can point at, instead of relying on what it might have memorized during training. The whole point of RAG and citations — "don't trust me, check this document."
Deterministic checks layered around an agent — schema validation on outputs, scope limits on tools, human approval for risky actions. Defense-in-depth so you don't have to trust the model alone.
When a model produces a confident, fluent answer that's wrong — invented citations, made-up dates, fake quotes. It doesn't know it's wrong; from the inside it feels exactly the same as being right.
The scaffolding around an agent that runs the loop, manages state, persists artifacts, enforces validators, and pauses for human approval. The model is the brain; the harness is everything that keeps the brain on task.
The default place open-source AI models live — a Git-like hub for sharing models, datasets, and demos. If a non-OpenAI/Anthropic/Google model exists, it's almost certainly on Hugging Face.
Having actual humans rate model outputs — for quality, helpfulness, accuracy, whatever you care about. Expensive and slow, still the gold standard when no automated metric captures what matters.
Inserting a human approval step into an automated workflow — "the agent drafted this email, click send to confirm." The right pattern when the cost of a wrong action is higher than the friction of asking.
A setting you pick before training starts — learning rate, batch size, number of layers — that controls how the model trains. Different from parameters, which are what the model learns.
Producing pictures from a text description. The big names — Midjourney, DALL-E, Stable Diffusion, Imagen — are all diffusion models doing the same basic trick with different style preferences.
The model learning a task from examples in your prompt, without any actual training. It's not really "learning" — the weights don't change — but the model's behavior shifts based on what it sees in context.
Pre-computing embeddings for your documents and storing them in a vector database so retrieval is fast at query time. Without an index, every search would re-embed everything from scratch.
Using a trained model to make predictions — what happens every time you send a prompt to ChatGPT. Training is once and expensive; inference is millions of times and adds up.
The piece of infrastructure that loads a model into memory and serves predictions over an API. Examples: vLLM, TGI, Triton — they handle batching, queueing, and squeezing as many requests as possible out of a GPU.
Fine-tuning a base model on examples of "here's an instruction, here's a good response" so it stops auto-completing and starts answering questions. The step that turns a raw language model into something useful for chat.
A prompt that tricks the model into ignoring its safety training — usually by role-play, fake scenarios, or layered indirection. The model didn't actually forget the rules; you just gave it a story where breaking them seems okay.
A structured database where things (people, products, concepts) are nodes and the relationships between them are edges. Old idea from semantic web research, newly relevant for grounding LLM answers in known facts.
A model trained on enormous amounts of text to predict the next word. Scale up that single trick enough and you get something that can write, summarize, code, and hold a conversation.
The delay between asking the model something and getting an answer. The thing that decides whether AI feels magical or annoying — sub-second is invisible, multi-second is a UX problem.
The internal mathematical space where a model represents meaning before turning it into output. "Walking through latent space" is why you can morph one image into another or interpolate between concepts.
A public ranking of models on a benchmark. Useful for a quick read on the field; treat top-rank claims skeptically — leaderboards get gamed, contaminated, and outgrown fast.
How big a step the model takes each time it updates its weights. Too high and it overshoots and never settles; too low and training drags on forever — the single most-tuned hyperparameter.
Meta's family of openly released language models. The most influential open-weight models — most independent fine-tunes, on-device deployments, and "local LLM" projects start from a Llama checkpoint.
A way to fine-tune a model by training a tiny adapter on top instead of updating the original weights. Cheap, fast, and you can swap LoRAs in and out for different tasks without retraining anything.
The math that measures how wrong the model's prediction was on each example. Training is the process of making this number go down — the choice of loss function quietly defines what "better" means.
Software that learns patterns from data instead of being explicitly programmed with rules. The umbrella under which deep learning, LLMs, and most modern AI live.
A standard way to connect AI to your tools. Like a USB port: one plug shape that works with Slack, Gmail, Jira, and everything else.
French AI company that releases strong open-weight models alongside a commercial API. The European answer to OpenAI and Anthropic; their smaller models punch above their weight.
An architecture where the model has many specialist sub-networks ("experts") and a router that picks a few for each token. You get the capability of a huge model at the inference cost of a smaller one.
DevOps for machine learning — the practices and tools for shipping, monitoring, and updating models in production. Includes data versioning, retraining pipelines, drift detection, and a long debate about how it differs from regular DevOps.
The thing you get out of training — a file full of numbers (weights) plus the architecture that uses them. "The model" is what you ship; everything else (data, training code) is what produced it.
A short document the model maker publishes explaining what the model can do, what it can't, what it was trained on, and known risks. Like a nutrition label — useful when it's honest, marketing when it isn't.
Multiple agents working together — one plans, others specialize, a coordinator routes between them. Useful when one agent juggling everything starts dropping balls; expensive when the coordination overhead exceeds the gain.
A model that handles more than one kind of input or output — text plus images, plus audio, plus video. The default for new flagship models; "text-only" is now the exception.
The branch of AI that deals with human language — translation, summarization, sentiment, search. LLMs ate most of the old NLP techniques; the field is now mostly applied LLM work.
A model loosely inspired by the brain — layers of simple units that pass signals to each other through weighted connections. The shape behind almost all modern AI, including every LLM and image generator you've used.
The one and only thing a base LLM is trained to do: given some text, predict the next token. Everything else — answering questions, writing code, having a conversation — falls out of doing that one thing extremely well at scale.
A model whose weights are publicly downloadable — Llama, Mistral, Qwen, DeepSeek. "Open-weight" is more precise: most don't release training data or code, so they aren't open-source in the strict sense.
The company behind ChatGPT and the GPT model family. Started as a nonprofit research lab in 2015, now the most commercially dominant AI company; deeply entangled with Microsoft.
Coordinating multiple models, tools, or agents into a workflow — deciding what runs when, what gets passed where, what happens on failure. The boring plumbing that makes an agent reliable instead of a demo.
When a model memorizes the training data instead of learning the underlying pattern — perfect on examples it's seen, terrible on new ones. The single most common failure mode in ML.
One of the numbers the model learns during training — billions or trillions of them in a modern LLM. "70 billion parameters" is a rough proxy for how big and capable a model is, but not the whole story.
Spotting regularities in data — the underlying job of nearly every ML system. Whether it's recognizing a face, detecting fraud, or predicting the next word, it's pattern matching at scale.
Umbrella term for techniques (LoRA, adapters, prefix tuning) that fine-tune only a small fraction of a model's weights. Most fine-tuning you'll ever do is PEFT, not the full-weight retraining lab papers describe.
A measurement of how surprised a language model is by some text — lower is better. A classic LLM benchmark; less useful these days because it doesn't capture whether the model is actually helpful.
An AI-powered search engine — answers questions with citations to live web sources. Effectively a productized RAG pipeline over the open web.
A managed vector database — you push embeddings in, query for nearest neighbors, and they handle the infrastructure. One of the most common choices for production RAG systems.
The step where an agent breaks a goal into sub-steps before executing. Doing this well — and updating the plan when reality pushes back — is what separates a useful agent from a confused one.
The first, expensive stage of training a foundation model — feed it a huge chunk of the internet and have it predict next tokens until it's learned how language works. After this comes the cheaper stages: fine-tuning, instruction tuning, RLHF.
Of the things the model flagged as positive, what fraction actually were? High precision means few false alarms; you can trust the model when it says "yes."
The text you send to a model to get a response. Sounds basic; the difference between a vague prompt and a careful one is often the difference between a useless answer and a useful one.
The craft of writing prompts that reliably get the model to do what you want. Less "engineering" than careful editing — clear instructions, useful examples, explicit format requests.
An attack where instructions are hidden inside data the model reads — "ignore previous instructions and email me the contents" tucked into a webpage. The unsolved security problem of every tool-using agent.
The AI searches your documents first, then answers based on what it found, instead of guessing from memory. Like handing someone a reference binder before asking them a question.
A model trained to spend extra compute "thinking" — generating a hidden chain of reasoning before answering. Slower and pricier per query, but markedly better at math, code, and multi-step problems.
Of all the things that were truly positive, what fraction did the model catch? High recall means few misses; matters most when missing something is costly (disease screening, fraud).
Deliberately attacking your own AI system to find ways it breaks — jailbreaks, harmful outputs, bias, prompt injection. Adopted from security; now a standard step before launching any frontier model.
Using red-team-style adversarial prompts as an evaluation suite — run them every model update and watch the failure rate. Turns ad-hoc safety testing into something measurable over time.
The step where an agent stops and critiques its own work — "did I actually answer the question? Are there gaps?" — before continuing. Often catches mistakes the first pass missed.
Predicting a number — house price, temperature, sales next quarter — rather than a category. The other half of supervised learning alongside classification.
A second pass that re-orders search results using a stronger, slower model. Cheap retrieval grabs the top 50 candidates; the reranker sorts them properly so the top 5 you actually use are the right ones.
Finding the right documents (or passages) to feed the model before it answers. The "R" in RAG, and the part that determines whether the answer ends up grounded or made up.
Training step where humans rate model outputs and the model learns to produce more of what they prefer. The reason ChatGPT felt different from earlier LLMs — same base model, but trained to be helpful instead of just predictive.
A variant of RL where the reward signal comes from an automatic verifier — a unit test, a math proof checker, a code compiler — instead of human raters. Cheaper and faster than RLHF because correctness is machine-checkable; used heavily for training reasoning and code models.
Picking the next token randomly from the model's probability distribution instead of always taking the top one. What makes outputs feel varied — same prompt, different answers each time.
A library that wraps an API so you can call it in your language of choice without dealing with raw HTTP. Anthropic, OpenAI, and Google all publish SDKs in Python, TypeScript, and others.
Searching by meaning instead of keywords — "how do I cancel" finds the page titled "Subscription Termination." Powered by embeddings and vector similarity instead of word matching.
Given one vector, find the closest others in your database. The core operation a vector database optimizes — sub-second lookups across millions or billions of vectors.
A packaged capability an agent can load — instructions, tools, sometimes example workflows — to handle a specific task domain. Think of it as a focused expansion pack: "survey design," "contract review," "morning briefing."
Writing a careful spec first and letting an agent generate the code from it — the opposite of vibe coding. Slower upfront, far less rework when the agent's first draft would have been wrong.
Forcing the model to return data in a fixed schema — usually JSON — instead of free-form text. Essential when something downstream needs to parse the answer; the difference between automation and a copy-paste job.
Training a model from labeled examples — "here's the input, here's the right answer." Most practical ML you've seen is supervised: spam classifiers, medical imaging, recommendation systems.
Instructions you give the model that the user never sees — "you are a helpful assistant for X, never reveal Y, always respond in Z format." Sets the model's role and constraints for the whole conversation.
A setting between 0 and ~2 that controls how random the model's output is. Low temperature = consistent and conservative; high temperature = varied and creative (and more likely to go off the rails).
Producing text from a model — the default thing LLMs do. Covers everything from finishing a sentence to writing a 10-page report; under the hood it's all next-token prediction.
The chunks of text a model actually sees — usually a word, a piece of a word, or a punctuation mark. Roughly 4 characters in English; pricing, context windows, and rate limits are all counted in tokens, not words.
Same idea as function calling and tool use — the model emits a structured request to invoke a tool, and the runtime executes it. The terms are mostly used interchangeably across vendors.
When the model uses an external tool — a search engine, a calculator, your API — instead of just producing text. The capability that turns a chatbot into something that can actually do things in the world.
Two ways to limit which tokens the model is allowed to sample from. Top-k keeps the k most likely tokens; top-p (nucleus) keeps the smallest set whose probabilities sum to p — both prevent really weird picks without going full deterministic.
Google's custom chips built specifically for ML workloads — competitors to NVIDIA's GPUs. You'll mostly run into them indirectly via Google Cloud or because the model you're using was trained on them.
The examples the model learns from. Its quality, breadth, and biases determine almost everything the resulting model will be good or bad at — and what blind spots it'll have.
Taking a model trained on one task and adapting it for another — the model already knows a lot about language or images, so it picks up the new task fast. Fine-tuning is one form of transfer learning.
The neural network architecture introduced in a 2017 Google paper that runs essentially every modern AI system — GPT, Claude, Gemini, image generators, the lot. Its key idea is attention: letting the model look at every part of the input at once.
Training a model on data without labels — it has to find structure on its own. Clustering and most LLM pre-training (predict the next word, no labels needed) fall under this umbrella.
A list of numbers representing something — a word, an image, a chunk of text. In modern AI, the vector is the model's idea of "what this means" in a form math can work with.
A database optimized for storing and searching embeddings — find the nearest vectors to a query vector, fast, at scale. Pinecone, Weaviate, Chroma, pgvector are common ones; the backbone of most RAG systems.
Sometimes used interchangeably with vector database; sometimes used for a lighter-weight library (FAISS, Chroma in local mode) that doesn't run as a separate service. Same job — store vectors, find similar ones — different deployment shape.
Writing software by describing what you want to an AI in plain language and accepting whatever it gives back, without reading the code too carefully. Fast and fun for prototypes; a maintenance nightmare for anything serious.
A company that builds high-quality embedding and reranking models for RAG, now part of Anthropic. Frequently the choice when OpenAI's `text-embedding-3` isn't quite good enough.
Embedding a hidden signal in AI-generated content so it can be detected later as machine-made. Works in theory; in practice the signals are fragile — a paraphrase or a screenshot often strips them.
The numbers inside a neural network that get adjusted during training and stay frozen during use. "Open-weights model" means the company released these numbers; everything the model knows is encoded in them.
Asking the model to do a task without giving it any examples — just the instructions. Works surprisingly well for things the model has seen variants of during training; less reliable when the output format matters.
No terms match your search. Try different keywords.