LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

arXiv cs.CL·Jiarui Zhao, Rongzhi Zhang, Lingchuan Liu, Hao Yang, Xunliang Cai, Xi Su

1d ago

·~1 min·6/12/2026·en·0

Quick Answer

LoHoSearch introduces a new benchmark for long-horizon search agents, featuring 544 human-verified questions across 11 domains.

Quick Take

LoHoSearch introduces a new benchmark for long-horizon search agents, featuring 544 human-verified questions across 11 domains. The best-performing model only achieves 34.74% accuracy, highlighting the limitations of current context management strategies, which yield minimal improvements over previous benchmarks.

Key Points

LoHoSearch comprises 544 questions verified by humans, enhancing benchmark complexity.
The benchmark is based on a knowledge graph with over 7 million Wikipedia entities.
Current top models show only 34.74% accuracy, indicating a high difficulty ceiling.
Existing context management strategies yield only a 6.8% improvement over prior benchmarks.
LoHoSearch sets a new standard for evaluating reasoning in search agents.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2606. 12837v1 Announce Type: new Abstract: Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break.

To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.

74% accuracy, and existing context management strategies (best +6. 8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

3w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy