AI search agents don't fail at searching, they fail at asking the right questions when queries get ambiguous

The Decoder·Jonathan Kemper

2h ago

·~6 min·7/5/2026·en·1

Quick Answer

AI search agents struggle with ambiguous queries, failing to ask clarifying questions, as shown by Tencent's DiscoBench.

Quick Take

AI search agents struggle with ambiguous queries, failing to ask clarifying questions, as shown by Tencent's DiscoBench. Eleven tested models, including Doubao Seed 2.0 Pro and Gemini 3.1 Pro, achieved under 50% accuracy, highlighting the need for improved ambiguity detection and user interaction.

Key Points

DiscoBench evaluates AI's ability to handle ambiguity in search queries.
Top model Doubao Seed 2.0 Pro achieved only 43.1% accuracy.
Models that ask clarifying questions significantly outperform those that guess.
Ambiguity types include factual errors, entity mismatches, and criteria confusion.
Guided prompts improved ambiguity detection but not overall success rates.

📖 Reader Mode

~6 min read

Jonathan Kemper

AI search agents rarely fail at multi-step research tasks because of the search itself. Their real problem is failing to ask the user for clarification when queries are ambiguous. That's the finding of a new benchmark from a team at Tencent Hunyuan and Tsinghua University. Repeated searching often performs worse than just guessing.

With DiscoBench, the researchers built a test framework that checks whether language models can spot ambiguity on their own during deep search chains, ask targeted follow-up questions, and correct their research path. Previous benchmarks like GAIA or BrowseComp assume user queries are complete and unambiguous.

But real-world queries are often vague, incomplete, or flat-out wrong. In long reasoning chains, every unresolved ambiguity compounds and steers the agent down the wrong path. If the model picks the wrong entity at an early node, it keeps searching with clean syntax but misses the actual target entirely.

Ablaufdiagramm zeigt zwei Suchpfade eines Agenten bei einer mehrdeutigen Mehrschritt-Anfrage, links der korrekte Pfad mit Rückfragen an den Nutzer, rechts der falsche Pfad mit direktem Raten, der Fehler über die Checkpoints CP1 bis CP3 fortpflanzt. — When a search agent guesses instead of clarifying ambiguities, the error cascades through the entire reasoning chain and produces a wrong final answer. | Image: Cheng et al.

Four types of ambiguity

DiscoBench contains 211 tasks with a total of 463 ambiguous points across eleven knowledge domains, including video games, sports, music, film, science, and politics. Each task is split into multiple checkpoints. At each checkpoint, the agent can choose between three actions: keep searching, ask the user for clarification, or give an answer.

Schema des interaktiven Retrieval-Frameworks von DISCOBENCH mit Verzweigung in eindeutige und mehrdeutige Checkpoints, Rückfrage an den Nutzer und rechts vier Metrikgruppen für Aufgabenerfolg, Mehrdeutigkeitserkennung, Interaktionsstrategie und Kosteneffizienz. — The framework checks whether the search is unambiguous at each checkpoint and evaluates agents across four metric groups, from task success to cost efficiency. | Image: Cheng et al.

The researchers define four types of ambiguity. A description might match multiple entities, apply to different time periods or versions, allow for multiple valid ranking or evaluation criteria, or contain an outright factual error. The dataset is mostly written in Chinese to reflect typical search patterns on the Chinese-language web.

When the agent asks a useful follow-up question, an LLM-based user simulator releases a predefined clue that helps narrow the search. All search queries run through the agent search engine Tavily, and Gemini 3 Flash serves as the simulator.

Zweistufige Konstruktionspipeline von DISCOBENCH, links Phase 1 mit Sammlung und Aufbau der Multi-Hop-Seed-Fragen, rechts Phase 2 mit Mehrdeutigkeitsinjektion, Erzeugung unterscheidender Fakten und Qualitätskontrolle. — The pipeline first builds clean multi-hop questions in phase one, then injects targeted ambiguities and distinguishing clues in phase two. | Image: Cheng et al.

Even large models stay below 50 percent

The team tested eleven models released in the past six months, including Claude Opus 4.7, GPT 5.4, Gemini 3.1 Pro Preview, Doubao Seed 2.0 Pro, DeepSeek V4 Pro, Kimi K2.6, GLM 5.1, Qwen3.6 Max, MiniMax M2.7, MiMo v2.5 Pro, and Hunyuan 3.0 Preview.

Without an explicit hint about possible ambiguity, Doubao Seed 2.0 Pro hit the highest end-to-end accuracy at 43.1 percent. Gemini 3.1 Pro followed at 40.8 percent, Claude Opus 4.7 at 39.8 percent. Weaker models like MiniMax M2.7 and Qwen3.6 Max managed only 16.1 and 12.3 percent, respectively.

Streudiagramm mit durchschnittlichen Tool-Aufrufen pro Frage auf der x-Achse und Genauigkeit in Prozent auf der y-Achse, jeder Punkt steht für ein Modell wie Gemini-3.1-Pro, Claude-Opus-4.7 oder Doubao-Seed-2.0-Pro. — More search calls don't lead to better accuracy. Claude Opus 4.7 searches frequently but still trails Gemini 3.1 Pro and Seed 2.0 Pro. | Image: Cheng et al.

There's a gap between individual step scores and overall results. Claude Opus 4.7, for example, solves 57 percent of checkpoints correctly but only reaches 39.8 percent end-to-end. The individual research steps work fine on their own, but a single unresolved ambiguity is enough to collapse the entire chain.

A warning prompt isn't enough

The authors also tested what happens when the system prompt explicitly tells the agent to watch for ambiguity and ask a follow-up question when in doubt. This "Guided" mode was meant to show the ceiling that's reachable when models don't have to figure out on their own that a question is underspecified.

Averaged across ten models, end-to-end accuracy rose from 28.6 to 33.7 percent. Detection F1 jumped much more sharply, from 45.3 to 64.9 percent. The hint mostly helped models spot ambiguity without actually helping them finish the research successfully. For Claude Opus 4.7, end-to-end accuracy even dipped slightly under the guided prompt, despite a higher checkpoint pass rate.

Searching more is worse than guessing

The behavioral profile analysis breaks down what agents actually do at ambiguous checkpoints. Models that search first and then ask a follow-up ("SearchThenAsk") average a 93.4 percent success rate. Guessing without asking ("DirectGuess") drops to 56.5 percent. Models that search repeatedly but still guess instead of asking ("SearchHeavyGuess") do even worse at 51.9 percent. According to the authors, the repeated searches suggest the model already spotted the ambiguity but never turned it into a user interaction.

That pattern also explains why more tool calls don't lead to better results. Claude Opus 4.7 searches more often than most other models but still trails Gemini 3.1 Pro and Doubao Seed 2.0 Pro in accuracy. Searching harder doesn't help if the agent never asks the right question.

Spotting ambiguity and asking good questions are two different skills

Detection ability and question quality don't track together. Qwen3.6 Max only reaches a Detection F1 of 16 percent and asks an average of 0.07 follow-up questions per task in the neutral setting. When it does ask, though, 94.7 percent of its questions are factually correct and 89.5 percent lead to progress. MiniMax M2.7 asks far more often but only achieves a follow-through rate of 60.7 to 66.5 percent.

A useful research agent needs both skills: recognizing when to ask a follow-up question and framing it so the answer actually moves the search forward.

Gruppiertes Balkendiagramm der Erkennungsrate von elf Modellen über die vier Mehrdeutigkeitstypen Entity, Factual Inaccuracy, Version und Criteria, mit insgesamt niedrigen Werten. — Models detect factual errors most easily, while entity and criteria ambiguities are much harder to spot. | Image: Cheng et al.

Broken down by ambiguity type, factual errors are easiest to detect because they create direct contradictions during research. Entity and criteria ambiguities are harder because multiple plausible candidates or unclear evaluation standards can coexist without any obvious contradiction.

AI agents need better follow-up strategies

Without access to search tools, the tested models collapse. Doubao Seed 2.0 Pro drops from 43.1 to 2.4 percent, Gemini 3.1 Pro from 40.8 to 19.9 percent. DiscoBench can't be solved from stored model knowledge alone. At the same time, models perform much better when ambiguity is removed from the questions, with accuracy jumping by 26.8 to 40.2 points depending on the model. The authors conclude that future search agents need mechanisms that turn search uncertainty into user interaction, on top of their retrieval and reasoning abilities.

Other recent work confirms that current search agents have basic weaknesses in how they research. One study found that leading models on benchmarks like BrowseComp often just confirm what they already know. On the purpose-built LiveBrowseComp with facts beyond the knowledge cutoff, all systems dropped by 25 to 40 points. The Halluhard benchmark also showed that Claude Opus 4.5 with web search hallucinates in about 30 percent of cases, mainly when verifying the content of cited sources.

Anthropic tackled this problem in its latest model update, Claude Opus 4.8. The model is supposed to flag uncertainties more often and leaves bugs in its own code uncommented about four times less frequently than its predecessor. Perplexity is trying a different approach with Search as Code, letting models write their search workflows as Python programs instead of calling a prebuilt API.

— Originally published at the-decoder.com

Continue reading on the-decoder.com

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from The Decoder

See more →

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

The Decoder·Matthias Bastian

1w ago

FeaturedOriginal

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

AI Summary

Epoch AI's MirrorCode benchmark reveals Claude Opus 4.7 as the leader with a 56% solve rate, reconstructing a 16,000-line toolkit in 14 hours. Despite this, all models tested struggle with the most complex tasks, highlighting limitations in current AI capabilities. The single task consumed $2,600 over 19 days, raising questions about cost-effectiveness in AI development.

#LLM #AI Coding #Inference #AI Startup