Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

2d ago

·~2 min·6/29/2026·en·0

Quick Answer

Ko-WideSearch introduces a Korean breadth-search benchmark for exhaustive set enumeration, highlighting challenges in attribute accuracy across web agents.

Quick Take

Ko-WideSearch introduces a Korean breadth-search benchmark for exhaustive set enumeration, highlighting challenges in attribute accuracy across web agents. The benchmark features 228 tables spanning 190 entities and shows a significant performance gap, with Item-F1 at 92.8 and Row-F1 at 53.7. This indicates difficulties in retrieving complete attribute data despite successful set recovery.

Key Points

Ko-WideSearch benchmarks breadth search in Korean, contrasting with depth-focused existing benchmarks.
The benchmark includes 228 tables across 190 entities and 16 categories.
Performance shows a gap: Item-F1 at 92.8 vs. Row-F1 at 53.7.
Challenges arise mainly in retrieving open-ended free-text cell values.
The benchmark is structured into three difficulty tiers based on table width and composite keys.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 25 Jun 2026]

View PDF HTML (experimental)

Abstract:Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling each item's attributes, is barely evaluated, especially outside English. Breadth is also hard to build: certifying that a gold set is complete and every cell correct is far costlier than checking a single answer. I introduce \textsc{Ko-WideSearch}, a Korean breadth-search benchmark built by an automated synthesize-and-verify pipeline. Each task names a set-parent entity -- a TV season, a dynasty, a league, an administrative region, an election -- and asks for its full membership plus a per-item attribute table, graded by Item-, Column-, and Row-F1. It spans 228 tables over 190 entities and sixteen categories across three difficulty tiers, set by two structural knobs I dial independently -- table width and a 2-D composite key -- so cross-product membership climbs from 0\% to 100\% across the tiers. A single normalization-aware comparator is shared between gold construction and grading, so stable date and count columns are not over-dropped on formatting alone. Across twenty web agents, the failure is consistent: agents recover the set but not the rows (e.g.\ Item-F1 92.8 against Row-F1 53.7), accuracy falls steadily as the knobs harden, and neither more search nor more spend closes the gap. Broken down by cell, the hard part is finding the right value, not formatting it: open-ended free-text cells fail most, while cells with a standard answer such as a date or a name usually come out right.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2606.27595 [cs.CL]
	(or arXiv:2606.27595v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2606.27595 arXiv-issued DOI via DataCite

Submission history

From: Minbyul Jeong [view email]
[v1] Thu, 25 Jun 2026 22:51:59 UTC (818 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems