LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling
Quick Answer
LoHoSearch introduces a new benchmark for long-horizon search agents, featuring 544 human-verified questions across 11 domains.
Quick Take
LoHoSearch introduces a new benchmark for long-horizon search agents, featuring 544 human-verified questions across 11 domains. The best-performing model only achieves 34.74% accuracy, highlighting the limitations of current context management strategies, which yield minimal improvements over previous benchmarks.
Key Points
- LoHoSearch comprises 544 questions verified by humans, enhancing benchmark complexity.
- The benchmark is based on a knowledge graph with over 7 million Wikipedia entities.
- Current top models show only 34.74% accuracy, indicating a high difficulty ceiling.
- Existing context management strategies yield only a 6.8% improvement over prior benchmarks.
- LoHoSearch sets a new standard for evaluating reasoning in search agents.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 12837v1 Announce Type: new Abstract: Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break.
To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.
74% accuracy, and existing context management strategies (best +6. 8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.