SANA: What Matters for QA Agents over Massive Data Lakes?
Quick Answer
The SANA framework enhances exploratory question answering (EQA) over data lakes by diagnosing failures in search, planning, and data analysis.
Quick Take
The SANA framework enhances exploratory question answering (EQA) over data lakes by diagnosing failures in search, planning, and data analysis. Evaluations on LakeQA and KramaBench reveal that data analysis is a consistent bottleneck, while search limitations vary by dataset size. SANA enables systematic comparisons of agent performance, identifying specific areas for improvement.
Key Points
- SANA transforms EQA tasks into runtime profiles for diagnostic purposes.
- Data analysis consistently emerges as a bottleneck across benchmarks.
- Search limitations are pronounced in LakeQA but less so in KramaBench.
- SANA allows for systematic comparisons of agent design and performance.
- End-to-end accuracy alone fails to pinpoint specific agent failures.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 13904v1 Announce Type: new Abstract: Exploratory question answering (EQA) over data lakes requires an LLM agent to discover relevant sources, analyze retrieved data, and adapt its actions based on intermediate results. End-to-end accuracy alone cannot distinguish failures in search, planning, data analysis, or the agent's Action Policy: its decisions about what to do next and when to submit an answer.
We present SANA (Search Agent Navigation Ablation framework), a diagnostic ablation framework that transforms EQA tasks into runtime profiles containing gold source sequence, sanitized subquestions, and execution records. SANA uses these profiles to construct idealized search, planning, and data-analysis tools, allowing each component to be ablated; the residual gap is diagnostic evidence for policy failures.
To illustrate SANA as a reusable evaluation framework, we adapted two recent EQA benchmarks, LakeQA and KramaBench, and evaluated lightweight and mid-sized agents under fixed prompts, budgets, data lakes, and runtimes. Across both benchmarks, data analysis is a consistent bottleneck while planning is less so. Search is a major limitation in LakeQA's large data-lake setting, but less so for the smaller-scale KramaBench.
SANA thus deconstructs end-to-end task accuracies into a diagnosis of where data-lake agents fail, and allows for systematic comparisons of progress in search, planning, data analysis, and agent design.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.