Stage-Audit: Auditable Source-Frontier Discovery for Cross-Wiki Tables
Quick Take
Stage-Audit enhances source precision in cross-wiki table discovery by implementing a structured auditing process.
Key Points
- Introduces disjoint curator-auditor write rights.
- Implements a row-level source-citation gate.
- Achieves significant precision and F1 improvements in evaluations.
📖 Reader Mode
~2 min readAbstract:LLM-curated tables can appear source-grounded while containing unsupported rows: the curator may recall entries from parametric memory and retroactively attach page-level citations that are not the actual source. We study this hazard in Seed2Frontier discovery: the task of finding complement Wikipedia pages from a seed page to assemble a structured table. Stage-Audit addresses it with disjoint curator-auditor write rights, a row-level source-citation gate, and a 12-check audit taxonomy over keys, schema, source roles, cardinality, and scope. On a curated 51-instance Seed2Frontier evaluation set spanning 15 top-level domains, Stage-Audit improves source-frontier precision over a vanilla LLM curator from 0.356 to 0.505 (+42% relative) and F1 from 0.334 to 0.451 (+35%), while maintaining explicit per-row source traceability. The vanilla-LLM-vs-Stage-Audit comparison isolates the policy contribution rather than LLM-based discovery in general.
| Comments: | 9 pages, 2 figures, 3 tables. Accepted at the ACM CAIS 2026 Workshop on AI Agents for Discovery in the Wild |
| Subjects: | Computation and Language (cs.CL) |
| ACM classes: | H.3.3; I.2.7 |
| Cite as: | arXiv:2605.20478 [cs.CL] |
| (or arXiv:2605.20478v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.20478 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Chen Shen [view email]
[v1]
Tue, 19 May 2026 20:41:35 UTC (33 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.