Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents
Quick Answer
Ko-WideSearch introduces a Korean breadth-search benchmark for exhaustive set enumeration, highlighting challenges in attribute accuracy across web agents.
Quick Take
Ko-WideSearch introduces a Korean breadth-search benchmark for exhaustive set enumeration, highlighting challenges in attribute accuracy across web agents. The benchmark features 228 tables spanning 190 entities and shows a significant performance gap, with Item-F1 at 92.8 and Row-F1 at 53.7. This indicates difficulties in retrieving complete attribute data despite successful set recovery.
Key Points
- Ko-WideSearch benchmarks breadth search in Korean, contrasting with depth-focused existing benchmarks.
- The benchmark includes 228 tables across 190 entities and 16 categories.
- Performance shows a gap: Item-F1 at 92.8 vs. Row-F1 at 53.7.
- Challenges arise mainly in retrieving open-ended free-text cell values.
- The benchmark is structured into three difficulty tiers based on table width and composite keys.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling each item's attributes, is barely evaluated, especially outside English. Breadth is also hard to build: certifying that a gold set is complete and every cell correct is far costlier than checking a single answer. I introduce \textsc{Ko-WideSearch}, a Korean breadth-search benchmark built by an automated synthesize-and-verify pipeline. Each task names a set-parent entity -- a TV season, a dynasty, a league, an administrative region, an election -- and asks for its full membership plus a per-item attribute table, graded by Item-, Column-, and Row-F1. It spans 228 tables over 190 entities and sixteen categories across three difficulty tiers, set by two structural knobs I dial independently -- table width and a 2-D composite key -- so cross-product membership climbs from 0\% to 100\% across the tiers. A single normalization-aware comparator is shared between gold construction and grading, so stable date and count columns are not over-dropped on formatting alone. Across twenty web agents, the failure is consistent: agents recover the set but not the rows (e.g.\ Item-F1 92.8 against Row-F1 53.7), accuracy falls steadily as the knobs harden, and neither more search nor more spend closes the gap. Broken down by cell, the hard part is finding the right value, not formatting it: open-ended free-text cells fail most, while cells with a standard answer such as a date or a name usually come out right.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2606.27595 [cs.CL] |
| (or arXiv:2606.27595v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.27595 arXiv-issued DOI via DataCite |
Submission history
From: Minbyul Jeong [view email]
[v1]
Thu, 25 Jun 2026 22:51:59 UTC (818 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.