AI Glossary
What is Terminal-Bench?
Overview
Terminal-Bench is a benchmark for evaluating whether AI agents can complete tasks in a terminal-like software environment. It matters because coding and operations agents need to run commands, inspect outputs, recover from errors, and finish multi-step work rather than only write code snippets.
Why it matters
Terminal-style benchmarks test the execution loop that real software agents depend on: plan, act, observe, and recover.
Where it appears in AI research
- AI coding agent evaluations
- Tool-use benchmark discussions
- Command-line automation research
- Developer agent product comparisons
Related terms
Related DeepSignal articles
FeaturedOriginal
Dissecting model behavior through agent trajectories
AI Summary
The paper identifies the 'intent-execution' gap in AI agents, emphasizing its significance alongside harness design. The 'Simple Strands Agent' (SSA) demonstrates improved performance on benchmarks like SWE-Pro and -2, analyzing 138k trajectories to uncover model-specific problem-solving behaviors.