The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge

arXiv cs.CL·Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer

3d ago

·~1 min·6/10/2026·en·0

Quick Answer

This paper investigates multi-agent debate systems, focusing on the correlation between token-level log-probabilities, LLM-as-judge scores, and final task accuracy.

Quick Take

This paper investigates multi-agent debate systems, focusing on the correlation between token-level log-probabilities, LLM-as-judge scores, and final task accuracy. It finds that confidence signals predict reasoning quality more reliably for the Constructor agent, with AUROC scores of 0.804 compared to 0.634 for the Auditor, highlighting the need for further cross-domain research.

Key Points

Examines three signals: log-probabilities, LLM-as-judge scores, and task accuracy.
Confidence aligns with reasoning quality more strongly for the Constructor (AUROC 0.804).
Critical reasoning failures are detected more reliably in the Constructor than the Auditor.
Findings suggest a need for broader investigations across different domains.
Study reveals a consistent four-phase confidence trajectory in debate systems.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 10296v1 Announce Type: new Abstract: Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between three signals in multi-agent debate: token-level log-probability distributions over reasoning tokens, LLM-as-judge rubric scores assigned to those tokens, and final task accuracy.

We examine whether internal confidence signals predict externally evaluated reasoning quality, and whether either signal aligns with task correctness, across three domains: rubric-based scoring, mathematical reasoning, and factual question answering. Our framework pairs a two-agent debate architecture -- a Constructor and an Auditor -- with an LLM-as-judge that scores each agent's reasoning along instruction following, justification quality, and evidence grounding, together with a critical-failure flag.

Experiments in the rubric-scoring domain reveal a consistent four-phase confidence trajectory and a substantial role asymmetry: confidence aligns with judged reasoning quality roughly twice as strongly for the Constructor as for the Auditor, and confidence-based detection of critical reasoning failures is markedly more reliable for the Constructor (AUROC 0. 804) than for the Auditor (0. 634). These findings motivate the broader cross-domain investigation proposed in this paper.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

3w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy