The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge
Quick Answer
This paper investigates multi-agent debate systems, focusing on the correlation between token-level log-probabilities, LLM-as-judge scores, and final task accuracy.
Quick Take
This paper investigates multi-agent debate systems, focusing on the correlation between token-level log-probabilities, LLM-as-judge scores, and final task accuracy. It finds that confidence signals predict reasoning quality more reliably for the Constructor agent, with AUROC scores of 0.804 compared to 0.634 for the Auditor, highlighting the need for further cross-domain research.
Key Points
- Examines three signals: log-probabilities, LLM-as-judge scores, and task accuracy.
- Confidence aligns with reasoning quality more strongly for the Constructor (AUROC 0.804).
- Critical reasoning failures are detected more reliably in the Constructor than the Auditor.
- Findings suggest a need for broader investigations across different domains.
- Study reveals a consistent four-phase confidence trajectory in debate systems.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10296v1 Announce Type: new Abstract: Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between three signals in multi-agent debate: token-level log-probability distributions over reasoning tokens, LLM-as-judge rubric scores assigned to those tokens, and final task accuracy.
We examine whether internal confidence signals predict externally evaluated reasoning quality, and whether either signal aligns with task correctness, across three domains: rubric-based scoring, mathematical reasoning, and factual question answering. Our framework pairs a two-agent debate architecture -- a Constructor and an Auditor -- with an LLM-as-judge that scores each agent's reasoning along instruction following, justification quality, and evidence grounding, together with a critical-failure flag.
Experiments in the rubric-scoring domain reveal a consistent four-phase confidence trajectory and a substantial role asymmetry: confidence aligns with judged reasoning quality roughly twice as strongly for the Constructor as for the Auditor, and confidence-based detection of critical reasoning failures is markedly more reliable for the Constructor (AUROC 0. 804) than for the Auditor (0. 634). These findings motivate the broader cross-domain investigation proposed in this paper.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.