Debate Helps Weak Judges Reward Stronger Models
Quick Take
In proposer-critic debate settings, stronger critics can significantly enhance judge performance on code and logic tasks, outperforming consultancy methods. However, when critics and judges have similar classification abilities, debate yields null effects, suggesting a need for careful evaluation of critic capabilities before implementation.
Key Points
- Debate improves judge performance when critics outperform judges in classification tasks.
- Statistically significant gains observed in three out of five model pairings.
- Null effects occur when critic and judge abilities are similar, reducing verification rates.
- Removing rebuttal rounds does not impact judge performance significantly.
- Suggests a cost-effective oversight method for verifiable domains.
Article Content
From source RSS / original summaryarXiv:2605. 27483v1 Announce Type: new Abstract: Despite theoretical promise, debate as a scalable oversight protocol has produced mixed empirical results: gains in some settings, and null effects in others, especially when the judge does not have information hidden from it. We study proposer-critic debate in a stronger-debater/weaker-judge setting on programmatically verifiable code and logic tasks.
Debate helps the judge over a consultancy baseline when the critic provides a usable advantage: the critic's classification ability must exceed the judge's, and the judge must treat critic speeches as claims to verify rather than testimony to summarize. On the three of five pairings where the condition holds, proposer-critic debate's gains are statistically significant over consultancy, and these pairings are the most capable model pairings.
On the two non-responder pairings in our set, debate produces null effects, and judge verification rates drop by tens of percentage points once a critic enters the transcript. In these cases the critic's binary-classification ability and the judge's are within noise of each other, and the critic's disagreement is parsed as testimony rather than a claim to check.
Ablating rebuttal rounds from debate produces no measurable change in judge performance: a single independent critique recovers the bulk of debate's benefit at lower inference cost. These findings suggest a cheaper primitive for training-free scalable oversight in verifiable domains (answer, critique, judge) and a pre-deployment audit (does the critic beat the judge, and will the judge verify it? ) that predicts when debate will help.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.