Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

5/20/2026

·~2 min·5/20/2026·en·2

Quick Answer

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

Quick Take

Key Points

REFLECT introduces a taxonomy for evaluating LLM judges in research contexts.
Current LLM judges show poor performance, especially in evidence verification.
Existing meta-evaluations rely on subjective human preferences and overlook open-ended tasks.
The benchmark aims to enhance reliability in evaluating deep research agents.
Findings expose systematic limitations in LLM judges, impacting research accuracy.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 18 May 2026]

View PDF HTML (experimental)

Abstract:Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled inTervention), a meta-evaluation benchmark targeting fine-grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process- and outcome-level failure modes, instantiated by performing controlled and localized interventions on quality-screened agent execution traces. This yields verifiable, comprehensive, and fine-grained instances for validating the judge models. Our experiments show that current LLM judges remain unreliable: even the best-performing models achieve overall accuracies below 55% across reasoning, tool-use, and report-quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2605.19196 [cs.CL]
	(or arXiv:2605.19196v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.19196 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Leyao Wang [view email]
[v1] Mon, 18 May 2026 23:55:08 UTC (2,750 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Safeguarding LLM Agents from Misalignment through Provenance Analysis

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Safeguarding LLM Agents from Misalignment through Provenance Analysis

Quantifying Prior Dominance in Systems