AI search agents often confirm what they already know instead of actually researching the web

The Decoder·Jonathan Kemper

5/31/2026

·~1 min·5/31/2026·en·3

Quick Answer

AI search agents like GPT-5.4 and Kimi K2.6 primarily confirm pre-existing knowledge rather than conducting real-time web research.

Quick Take

AI search agents like GPT-5.4 and Kimi K2.6 primarily confirm pre-existing knowledge rather than conducting real-time web research. A study from Harbin Institute of Technology using the LiveBrowseComp benchmark reveals that when models are tested on events from the last 90 days, their performance declines significantly, indicating a reliance on training data over current information.

Key Points

GPT-5.4 and Kimi K2.6 struggle with real-time information retrieval.
LiveBrowseComp benchmark tests performance on events from the last 90 days.
Models' performance drops significantly when they can't rely on training data.
Research highlights limitations in AI search agents' web research capabilities.
Existing rankings of AI models are affected when tested on recent events.

Article Excerpt

From source RSS / original summary

Leading AI search agents like GPT-5. 4 and Kimi K2. 6 don't appear to do much actual research on established benchmarks. They mostly just use the web to confirm what they already learned during training. Researchers at the Harbin Institute of Technology found this using a new time-based benchmark called LiveBrowseComp, which only asks about events from the last 90 days. Once the models can't fall back on memory, performance falls apart and the existing rankings get reshuffled.

The article AI search agents often confirm what they already know instead of actually researching the web appeared first on The Decoder.

Read on the-decoder.com

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from The Decoder

See more →

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

The Decoder·Matthias Bastian

2w ago

FeaturedOriginal

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

AI Summary

Epoch AI's MirrorCode benchmark reveals Claude Opus 4.7 as the leader with a 56% solve rate, reconstructing a 16,000-line toolkit in 14 hours. Despite this, all models tested struggle with the most complex tasks, highlighting limitations in current AI capabilities. The single task consumed $2,600 over 19 days, raising questions about cost-effectiveness in AI development.

#LLM #AI Coding #Inference #AI Startup