Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents
Quick Answer
Dialogue SWE-Bench introduces a benchmark for evaluating coding agents' dialogue capabilities in real-world software engineering tasks.
Quick Take
Dialogue introduces a benchmark for evaluating coding agents' dialogue capabilities in real-world software engineering tasks. A novel user simulator and schema-guided agent improve dialogue performance by 3-14%, highlighting that coding proficiency doesn't equate to dialogue effectiveness.
Key Points
- Dialogue SWE-Bench evaluates coding agents through user dialogue rather than autonomous tasks.
- A persona-grounded user simulator enhances task evaluation and dialogue quality assessment.
- The schema-guided agent outperforms strong baselines by 3-14% in dialogue tasks.
- Findings suggest dialogue capability is a distinct and underexplored aspect of coding agents.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 13995v1 Announce Type: new Abstract: AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this work, we introduce Dialogue , an automatic benchmark dataset for evaluating the ability of coding agents to resolve real-world software engineering problems through dialogue with a user.
We design a novel, persona-grounded user simulator to support our task evaluation, and augment our task evaluation with automatic evaluations of dialogue quality. We also propose a new schema-guided agent, aimed at improving the dialogue capabilities of off-the-shelf coding agents, which improves over strong baselines by 3-14%.
Our results indicate that better coding models do not always correspond to better dialogue models, suggesting that dialogue capability is a distinct and currently understudied dimension of coding agent performance.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.