Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants
Quick Answer
The Shopping Reasoning Bench introduces a benchmark for evaluating multi-turn conversational shopping assistants, comprising 525 missions and 10,863 expert-authored rubrics.
Quick Take
The Shopping Reasoning Bench introduces a benchmark for evaluating multi-turn conversational shopping assistants, comprising 525 missions and 10,863 expert-authored rubrics. Current models like GPT, Claude, and Gemini achieve only 57-77% pass rates, indicating a significant gap in expert-level shopping advice.
Key Points
- Shopping Reasoning Bench includes 525 missions: 232 single-turn and 293 multi-turn.
- Expert-authored criteria cover five reasoning categories and fifteen subcategories.
- Nine models tested show pass rates of only 57-77% overall.
- Performance degrades by 4-18 points in multi-turn conversations.
- Current models provide basic assistance but lack expert-level advice.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 12608v1 Announce Type: new Abstract: Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping conversations demand. Shopping reasoning is unique among language model applications.
Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product trade-offs across multi-turn dialogue, capabilities absent from previous e-commerce and general-purpose benchmarks. We introduce the Shopping Reasoning Bench, an expert-authored benchmark of 525 missions (232 single-turn, 293 multi-turn) with 10863 importance-weighted binary rubrics authored by retail domain experts.
These criteria are organized under a taxonomy of five reasoning categories and fifteen subcategories covering diverse demands such as preference refinement, trade-off analysis, and compatibility assessment. An evaluation of nine models across three families (GPT, Claude, Gemini) shows that pass rates reach only 57--77% overall. On multi-turn missions, all models score 13--29 points lower on optional above-and-beyond criteria than on required ones, and performance degrades 4--18 points as conversations progress.
These gaps show that current models handle basic shopping assistance but fall short of expert-level advice, making Shopping Reasoning Bench a challenging testbed for future shopping assistant development.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.