When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning
Quick Take
Multi-agent debate can enhance data cleaning by improving error detection (+27.4pp F1) but may degrade generation (-1.6 to -15.5pp) due to critique-induced confusion. A successful configuration involves adversarial separation, leading to a 5.3pp improvement over single-agent tasks.
Key Points
- Debate's effect can reverse, degrading generation while improving error detection.
- Critique-induced confusion leads to significant performance drops across four model families.
- A factorial experiment confirms the necessity of adversarial separation for success.
- The new configuration outperforms single-agent tasks with a 5.3pp improvement.
- Condition for success: rescuing wrong outputs must outweigh destroying correct ones.
Article Excerpt
From source RSS / original summaryarXiv:2606. 02866v1 Announce Type: new Abstract: When does multi-agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task-condition pairs, we find debate's effect reverses sign: it degrades generation across all four models (-1. 6 to -15. 5pp) through critique-induced confusion (CIC), hallucinated Critic feedback that the Generator accepts uncritically, yet improves error detection (+27. 4pp F1, d=1. 0).
We derive a debate benefit condition: debate helps when the probability of rescuing a wrong output (Critic verification odds weighted by fixability) exceeds the probability of destroying a correct one. A factorial experiment proves adversarial separation is essential: self-verification with identical tools fails, while a separate Critic with code-execution grounding and evidence-gated generation produces the first debate configuration to significantly exceed single-agent on a generative task (+5. 3pp, p<0. 05).
The condition correctly predicts all nine task types and generalizes with zero false positives across 19 published comparisons in seven domains.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification
AuditFlow introduces a multi-agent framework for structured financial reporting verification, achieving 82.09% accuracy with GPT-5.5, outperforming the baseline by 14.93 points. It utilizes a symbolic environment for effective audit processes, demonstrating the necessity of deterministic checks for reliable verification.