Can LLM Teams Play What? Where? When?
Quick Take
Team-based strategies significantly enhance the performance of large language models (LLMs) in the quiz game What? Where? When? (ChGK), achieving up to 44.23% accuracy and outperforming single-model baselines by up to 20 percentage points. The study highlights the effectiveness of collaborative reasoning and suggests that LLM teams primarily serve as answer selection and error-filtering mechanisms.
Key Points
- Three team strategies were tested: Voting, Silent Team, and Talkative Team.
- Team-based approaches yielded accuracy gains of up to 20 percentage points.
- The best team achieved 44.23% accuracy, nearing human performance.
- Disagreement among models predicts lower accuracy, but communication helps.
- Captains improved judgments with access to peer rationales.
Article Content
From source RSS / original summaryarXiv:2605. 30459v1 Announce Type: new Abstract: Large language models (LLMs) remain limited on tasks requiring indirect reasoning, cultural knowledge, and coordinated hypothesis testing. We investigate whether team-based interaction improves LLM performance in What? Where? When? (ChGK), a quiz game designed to reward collective reasoning. We introduce three team strategies: Voting, Silent Team (the captain observes final answers), and Talkative Team (the captain observes both answers and rationales).
To minimize data leakage, we evaluate these strategies on a dataset consisting of 572 ChGK questions released in 2025. Using six recent large-scale open models, we show that team-based strategies outperform single-model baselines, yielding gains of up to 20 percentage points in accuracy. The best team achieves 44. 23% accuracy, and approaches human team performance on questions with available human statistics.
Analysis of inter-model diversity reveals that disagreement strongly predicts lower accuracy, but explanatory communication substantially mitigates performance drops. We further examine captain behavior and find no evidence of self-preference bias; access to peer rationales improves captain judgments. Overall, LLM teams function primarily as answer selection and error-filtering mechanisms rather than generators of novel solutions.
Our findings highlight the importance of interaction and suggest adaptive strategies as a promising direction for multi-agent systems.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.