Debate Helps Weak Judges Reward Stronger Models

arXiv cs.CL·Ethan Elasky, Frank Nakasako, Naman Goyal

5/28/2026

·~2 min·5/28/2026·en·1

Quick Answer

This paper shows that In proposer-critic debate settings, stronger critics can significantly enhance judge performance on code and logic tasks, outperforming consultancy methods.

Quick Take

In proposer-critic debate settings, stronger critics can significantly enhance judge performance on code and logic tasks, outperforming consultancy methods. However, when critics and judges have similar classification abilities, debate yields null effects, suggesting a need for careful evaluation of critic capabilities before implementation.

Key Points

Debate improves judge performance when critics outperform judges in classification tasks.
Statistically significant gains observed in three out of five model pairings.
Null effects occur when critic and judge abilities are similar, reducing verification rates.
Removing rebuttal rounds does not impact judge performance significantly.
Suggests a cost-effective oversight method for verifiable domains.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2605. 27483v1 Announce Type: new Abstract: Despite theoretical promise, debate as a scalable oversight protocol has produced mixed empirical results: gains in some settings, and null effects in others, especially when the judge does not have information hidden from it. We study proposer-critic debate in a stronger-debater/weaker-judge setting on programmatically verifiable code and logic tasks.

Debate helps the judge over a consultancy baseline when the critic provides a usable advantage: the critic's classification ability must exceed the judge's, and the judge must treat critic speeches as claims to verify rather than testimony to summarize. On the three of five pairings where the condition holds, proposer-critic debate's gains are statistically significant over consultancy, and these pairings are the most capable model pairings.

On the two non-responder pairings in our set, debate produces null effects, and judge verification rates drop by tens of percentage points once a critic enters the transcript. In these cases the critic's binary-classification ability and the judge's are within noise of each other, and the critic's disagreement is parsed as testimony rather than a claim to check.

Ablating rebuttal rounds from debate produces no measurable change in judge performance: a single independent critique recovers the bulk of debate's benefit at lower inference cost. These findings suggest a cheaper primitive for training-free scalable oversight in verifiable domains (answer, critique, judge) and a pre-deployment audit (does the critic beat the judge, and will the judge verify it? ) that predicts when debate will help.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Miguel Arana-Catania, Catherine Conisbee, Matthew Kidd

1d ago

FeaturedOriginal

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

AI Summary

The study evaluates three NLP approaches—Named Entity Recognition, Keyword Extraction, and Topic Modelling—using the Their Finest Hour Online Archive to automate keyword extraction from crowdsourced WWII collections. Findings suggest that while NLP methods show promise, no single approach is sufficient, and ethical considerations in automated keyword extraction are crucial for responsible stewardship.

#AI Coding #Inference #Open Source #Policy

Debate Helps Weak Judges Reward Stronger Models

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quantifying Prior Dominance in Systems