UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do

The Decoder·Matthias Bastian

3h ago

·~1 min·7/3/2026·en·2

Quick Answer

Quick Take

The UK's AI Security Institute reveals that standard benchmarks underestimate AI agent capabilities, with a 25% increase in success rates for software engineering tasks when the token budget is increased tenfold. Newer models show a 60% steeper progress at the frontier than previously measured, highlighting the need for revised evaluation methods.

Key Points

Standard benchmarks cap compute budgets, leading to underestimation of AI capabilities.
Success rates for software engineering tasks increased by 25% with a tenfold token budget increase.
Newer AI models benefit the most from increased token budgets.
Actual progress at the frontier is 60% steeper than prior measurements suggested.
The findings call for a reevaluation of AI evaluation methods.

Article Excerpt

From source RSS / original summary

In a study covering seven benchmarks, the UK's AI Security Institute shows that standard AI evaluations systematically underestimate agent capabilities by capping the compute budget. On software engineering tasks, success rates jumped about 25 percent when the token budget was increased tenfold. Newer models benefit the most. Depending on the token budget, actual progress at the frontier is about 60 percent steeper than previous measurements suggested, according to AISI.

The article UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do appeared first on The Decoder.

Read on the-decoder.com

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from The Decoder

See more →

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

The Decoder·Matthias Bastian

1w ago

FeaturedOriginal

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

AI Summary

Epoch AI's MirrorCode benchmark reveals Claude Opus 4.7 as the leader with a 56% solve rate, reconstructing a 16,000-line toolkit in 14 hours. Despite this, all models tested struggle with the most complex tasks, highlighting limitations in current AI capabilities. The single task consumed $2,600 over 19 days, raising questions about cost-effectiveness in AI development.

#LLM #AI Coding #Inference #AI Startup