Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
Quick Take
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.
Key Points
- LLM judges show less than 55% accuracy overall.
- Current evaluations rely on subjective human preferences.
- REFLECT introduces fine-grained failure detection methods.
📖 Reader Mode
~2 min readAbstract:Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled inTervention), a meta-evaluation benchmark targeting fine-grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process- and outcome-level failure modes, instantiated by performing controlled and localized interventions on quality-screened agent execution traces. This yields verifiable, comprehensive, and fine-grained instances for validating the judge models. Our experiments show that current LLM judges remain unreliable: even the best-performing models achieve overall accuracies below 55% across reasoning, tool-use, and report-quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.19196 [cs.CL] |
| (or arXiv:2605.19196v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.19196 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Leyao Wang [view email]
[v1]
Mon, 18 May 2026 23:55:08 UTC (2,750 KB)
— Originally published at arxiv.org
More from arXiv cs.CL
See more →Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution
The Stepwise Confidence Attribution framework enhances diagnosis of reasoning failures in black-box LLMs.