Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks
Quick Take
Conv-to-Bench introduces a scalable framework for evaluating LLMs through user-assistant dialogues, achieving Spearman correlations of up to 1.000 with human benchmarks like BigCodeBench, while reducing computational costs. The method enhances evaluation reliability, with a primary evaluator showing a substantial agreement of κ = 0.705 with human-verified standards.
Key Points
- Conv-to-Bench transforms user-assistant dialogues into structured evaluation checklists.
- Achieves Spearman correlation of 1.000 with BigCodeBench, indicating high alignment.
- LLM-as-a-judge framework shows κ = 0.705 agreement with human-verified results.
- Multi-turn interactions effectively capture user intent evolution.
- Provides a cost-effective solution for maintaining evaluation standards in AI.
Article Content
From source RSS / original summaryarXiv:2605. 26440v1 Announce Type: new Abstract: The rapid advancement of Large Language Models (LLMs) has outpaced the scalability of traditional evaluation benchmarks, which remain heavily dependent on labor-intensive expert curation. We address this bottleneck with Conv-to-Bench, a multi-stage framework that automatically transforms authentic multi-turn user-assistant dialogues into structured, verifiable requirement checklists.
By leveraging the "instructional evolution" found in real-world conversational logs, our approach deconstructs fragmented user intent into consolidated instructions and binary evaluation criteria. Applied to the programming domain, Conv-to-Bench produces evaluation sets that demonstrate near-perfect alignment with human-authored standards like BigCodeBench, achieving Spearman correlations of up to $\rho$ = 1. 000 with significantly lower computational overhead.
Validation of the LLM-as-a-judge framework further confirms its reliability, with the primary evaluator achieving substantial agreement with human-verified ground truth ($\kappa$ = 0. 705). Our comprehensive ablation studies reveal that while multi-turn interactions capture the iterative evolution of user intent, instruction-centric extraction provides a more robust foundation.
Ultimately, Conv-to-Bench provides a scalable, cost-effective paradigm for maintaining high-fidelity evaluation standards as user-centric AI applications continue to diversify.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.