AI Glossary

Short explanations of AI benchmarks, models, agents and infrastructure terms that appear across DeepSignal.

AI Glossary

SWE-Bench

SWE-Bench is a software-engineering benchmark that tests whether AI systems can fix real GitHub issues inside existing repositories. It matters because coding agents are now judged less by toy coding prompts and more by whether they can understand bugs, edit multi-file codebases, run tests, and produce accepted patches.

AI Glossary

Humanity's Last Exam

Humanity's Last Exam is a difficult expert-level benchmark for testing frontier AI systems across broad academic and professional knowledge. It matters because many standard benchmarks are saturated, so labs use harder exams like HLE to show whether models can answer questions that still challenge specialists.

AI Glossary

GPQA

GPQA is a graduate-level science question-answering benchmark designed to test difficult expert reasoning. It matters because strong GPQA scores suggest a model can handle specialized physics, chemistry, and biology questions that are hard to solve by search or simple pattern matching.

AI Glossary

MMLU

MMLU, or Massive Multitask Language Understanding, is a broad benchmark that evaluates model knowledge across many academic and professional subjects. It matters because it became a standard reference point for LLM releases, even though newer models increasingly need harder benchmarks to show meaningful gains.

AI Glossary

ARC-AGI

ARC-AGI is an abstraction and reasoning benchmark where AI systems solve novel visual tasks from only a few examples. It matters because it targets generalization: systems must infer hidden rules instead of relying on memorized internet text or familiar benchmark patterns.

AI Glossary

LiveCodeBench

LiveCodeBench is a coding benchmark built from recent programming tasks to evaluate code generation and problem solving. It matters because using newer tasks helps reduce benchmark contamination, making it harder for models to succeed by memorizing older public examples.

AI Glossary

MCP

MCP, or Model Context Protocol, is a protocol for connecting AI assistants to external tools, data sources, and services through a standard interface. It matters because agent and coding workflows increasingly need reliable context access without every app inventing a custom integration layer.

AI Glossary

RAG

RAG, or Retrieval-Augmented Generation, is a pattern where an AI system retrieves relevant documents before generating an answer. It matters because retrieval can ground responses in current or private information, reducing hallucination risk when the model alone lacks the needed context.

AI Glossary

Context Engineering

Context engineering is the practice of designing what information, tools, instructions, memory, and retrieved evidence an AI system receives before it acts. It matters because stronger models still fail when the surrounding context is stale, noisy, incomplete, or poorly structured.

AI Glossary

Function Calling

Function calling is a model capability where an AI system returns structured arguments for a developer-defined function or tool. It matters because it lets language models take reliable actions, query APIs, and produce machine-readable outputs instead of only generating prose.

AI Glossary

Tool Use

Tool use is the ability of an AI system to call external tools such as search, code execution, databases, calculators, or business APIs. It matters because many real tasks require current data or side effects that a language model cannot provide from weights alone.

AI Glossary

Agent Memory

Agent memory is the information an AI agent stores or retrieves across steps, sessions, users, or tasks. It matters because persistent memory can improve continuity and personalization, but it also introduces accuracy, privacy, and governance risks if the stored context is wrong or overused.

AI Glossary

Agent Evaluation

Agent evaluation measures whether AI agents can plan, call tools, recover from errors, and complete multi-step tasks. It matters because one-shot model benchmarks do not fully capture real agent behavior, where reliability depends on orchestration, memory, tools, and execution traces.

AI Glossary

Multimodal AI

Multimodal AI refers to models that can process or generate multiple data types such as text, images, audio, video, and sensor inputs. It matters because many real-world tasks depend on combining language with visual or auditory evidence rather than treating text as the only interface.

AI Glossary

Open-Weight AI

Open-weight AI refers to models whose trained weights are released for others to download, inspect, fine-tune, or deploy. It matters because open weights can reduce vendor lock-in and enable private deployment, while still leaving open questions about licensing, safety, and true openness.

AI Glossary

Physical AI

Physical AI refers to AI systems designed to perceive, reason about, and act in the physical world through robots, vehicles, sensors, or simulations. It matters because model progress is moving beyond chat and code into autonomy, manufacturing, logistics, and safety-critical environments where real-world reliability is the product.

AI Glossary

Multi-Agent Systems

Multi-agent systems use multiple AI agents that coordinate, debate, delegate, or specialize across a task. They matter because many real workflows are too broad for a single model call: teams are testing planner, researcher, coder, reviewer, and tool-using agents that work together with shared state and guardrails.

AI Glossary

Vision-Language Models

Vision-language models are multimodal AI systems that jointly process images or video with text. They matter because assistants, robotics, document automation, medical imaging, and UI agents increasingly need visual evidence plus language reasoning instead of text-only context.

AI Glossary

Direct Preference Optimization

Direct Preference Optimization is a training method that tunes language models from preference data without a separate reinforcement learning loop. It matters because many labs and open-model teams use DPO-style methods to align responses, improve instruction following, and make models cheaper to refine after supervised training.

AI Glossary

Terminal-Bench

Terminal-Bench is a benchmark for evaluating whether AI agents can complete tasks in a terminal-like software environment. It matters because coding and operations agents need to run commands, inspect outputs, recover from errors, and finish multi-step work rather than only write code snippets.

AI Glossary

On-device AI

On-device AI runs models directly on phones, PCs, robots, cars, or edge hardware instead of sending every request to a cloud service. It matters because local inference can reduce latency, improve privacy, lower serving costs, and make AI features work in constrained or offline environments.

AI Glossary

Large Language Models (LLMs)

A large language model (LLM) is an AI model trained on large text datasets to understand and generate language, code, and structured responses. LLMs matter because they provide the foundation for modern assistants, coding tools, search products, and agents, while their reliability still depends on context, tools, evaluation, and deployment choices.

AI Glossary

Test-Time Scaling

Test-time scaling improves an AI system's answer by spending more computation during inference, for example through longer reasoning, multiple candidate solutions, search, or verification. It matters because models can often solve harder tasks without retraining when the system allocates compute adaptively and uses reliable methods to select or check the result.

AI Glossary

Long-Horizon Agents

Long-horizon agents are AI systems designed to complete tasks that require many steps, extended execution time, persistent state, and repeated tool use. They matter because real software, research, and operational workflows demand more than a single response: agents must preserve context, detect errors, recover, and remain aligned with the original goal.

AI Glossary

GRPO

GRPO, or Group Relative Policy Optimization, is a reinforcement-learning method that trains a model by comparing rewards across a group of sampled responses instead of relying on a separate value model. It matters because it can make reasoning-model post-training more memory-efficient while still encouraging responses that score better on verifiable tasks.