Can LLM Teams Play What? Where? When?

arXiv cs.CL·Anastasia Kotelnikova, Viktor Byzov, Maria Dolzhenkova, Evgeny Kotelnikov

4h ago

·~1 min·6/1/2026·en·0

Quick Take

Team-based strategies significantly enhance the performance of large language models (LLMs) in the quiz game What? Where? When? (ChGK), achieving up to 44.23% accuracy and outperforming single-model baselines by up to 20 percentage points. The study highlights the effectiveness of collaborative reasoning and suggests that LLM teams primarily serve as answer selection and error-filtering mechanisms.

Key Points

Three team strategies were tested: Voting, Silent Team, and Talkative Team.
Team-based approaches yielded accuracy gains of up to 20 percentage points.
The best team achieved 44.23% accuracy, nearing human performance.
Disagreement among models predicts lower accuracy, but communication helps.
Captains improved judgments with access to peer rationales.

Article Content

From source RSS / original summary

arXiv:2605. 30459v1 Announce Type: new Abstract: Large language models (LLMs) remain limited on tasks requiring indirect reasoning, cultural knowledge, and coordinated hypothesis testing. We investigate whether team-based interaction improves LLM performance in What? Where? When? (ChGK), a quiz game designed to reward collective reasoning. We introduce three team strategies: Voting, Silent Team (the captain observes final answers), and Talkative Team (the captain observes both answers and rationales).

To minimize data leakage, we evaluate these strategies on a dataset consisting of 572 ChGK questions released in 2025. Using six recent large-scale open models, we show that team-based strategies outperform single-model baselines, yielding gains of up to 20 percentage points in accuracy. The best team achieves 44. 23% accuracy, and approaches human team performance on questions with available human statistics.

Analysis of inter-model diversity reveals that disagreement strongly predicts lower accuracy, but explanatory communication substantially mitigates performance drops. We further examine captain behavior and find no evidence of self-preference bias; access to peer rationales improves captain judgments. Overall, LLM teams function primarily as answer selection and error-filtering mechanisms rather than generators of novel solutions.

Our findings highlight the importance of interaction and suggest adaptive strategies as a promising direction for multi-agent systems.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

1w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy