DeepSignal
© 2026 DeepSignal · About
  • All
  • Featured
  • Latest
  • Guides
  • Daily
  • Weekly
  • Saved
  • Subscribe
  • Sources
  • About
  • Feedback
Sign in
  • Featured
  • Latest
  • Guides
  • Daily
  • Weekly

    AI Glossary

    Short explanations of AI benchmarks, models, agents and infrastructure terms that appear across DeepSignal.

    AI Glossary

    SWE-Bench

    SWE-Bench is a software-engineering benchmark that tests whether AI systems can fix real GitHub issues inside existing repositories. It matters because coding agents are now judged less by toy coding prompts and more by whether they can understand bugs, edit multi-file codebases, run tests, and produce accepted patches.

    AI Glossary

    Humanity's Last Exam

    Humanity's Last Exam is a difficult expert-level benchmark for testing frontier AI systems across broad academic and professional knowledge. It matters because many standard benchmarks are saturated, so labs use harder exams like HLE to show whether models can answer questions that still challenge specialists.

    AI Glossary

    GPQA

    GPQA is a graduate-level science question-answering benchmark designed to test difficult expert reasoning. It matters because strong GPQA scores suggest a model can handle specialized physics, chemistry, and biology questions that are hard to solve by search or simple pattern matching.

    AI Glossary

    MMLU

    MMLU, or Massive Multitask Language Understanding, is a broad benchmark that evaluates model knowledge across many academic and professional subjects. It matters because it became a standard reference point for LLM releases, even though newer models increasingly need harder benchmarks to show meaningful gains.

    AI Glossary

    ARC-AGI

    ARC-AGI is an abstraction and reasoning benchmark where AI systems solve novel visual tasks from only a few examples. It matters because it targets generalization: systems must infer hidden rules instead of relying on memorized internet text or familiar benchmark patterns.

    AI Glossary

    LiveCodeBench

    LiveCodeBench is a coding benchmark built from recent programming tasks to evaluate code generation and problem solving. It matters because using newer tasks helps reduce benchmark contamination, making it harder for models to succeed by memorizing older public examples.

    AI Glossary

    MCP

    MCP, or Model Context Protocol, is a protocol for connecting AI assistants to external tools, data sources, and services through a standard interface. It matters because agent and coding workflows increasingly need reliable context access without every app inventing a custom integration layer.

    AI Glossary

    RAG

    RAG, or Retrieval-Augmented Generation, is a pattern where an AI system retrieves relevant documents before generating an answer. It matters because retrieval can ground responses in current or private information, reducing hallucination risk when the model alone lacks the needed context.

    AI Glossary

    Context Engineering

    Context engineering is the practice of designing what information, tools, instructions, memory, and retrieved evidence an AI system receives before it acts. It matters because stronger models still fail when the surrounding context is stale, noisy, incomplete, or poorly structured.

    AI Glossary

    Function Calling

    Function calling is a model capability where an AI system returns structured arguments for a developer-defined function or tool. It matters because it lets language models take reliable actions, query APIs, and produce machine-readable outputs instead of only generating prose.

    AI Glossary

    Tool Use

    Tool use is the ability of an AI system to call external tools such as search, code execution, databases, calculators, or business APIs. It matters because many real tasks require current data or side effects that a language model cannot provide from weights alone.

    AI Glossary

    Agent Memory

    Agent memory is the information an AI agent stores or retrieves across steps, sessions, users, or tasks. It matters because persistent memory can improve continuity and personalization, but it also introduces accuracy, privacy, and governance risks if the stored context is wrong or overused.

    AI Glossary

    Agent Evaluation

    Agent evaluation measures whether AI agents can plan, call tools, recover from errors, and complete multi-step tasks. It matters because one-shot model benchmarks do not fully capture real agent behavior, where reliability depends on orchestration, memory, tools, and execution traces.

    AI Glossary

    Multimodal AI

    Multimodal AI refers to models that can process or generate multiple data types such as text, images, audio, video, and sensor inputs. It matters because many real-world tasks depend on combining language with visual or auditory evidence rather than treating text as the only interface.

    AI Glossary

    Open-Weight AI

    Open-weight AI refers to models whose trained weights are released for others to download, inspect, fine-tune, or deploy. It matters because open weights can reduce vendor lock-in and enable private deployment, while still leaving open questions about licensing, safety, and true openness.