Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs
Quick Answer
The Poker Arena introduces a novel platform for evaluating LLMs using a multi-axis cognitive profile in Texas Hold'em, revealing that Claude Opus 4.6, despite winning $15,730, ranks fifth in mean axis score.
Quick Take
The Poker Arena introduces a novel platform for evaluating LLMs using a multi-axis cognitive profile in Texas Hold'em, revealing that Claude Opus 4.6, despite winning $15,730, ranks fifth in mean axis score. This highlights the inadequacy of scalar leaderboards in capturing model capabilities, as persistent memory impacts performance variably across models.
Key Points
- Poker Arena evaluates LLMs using a three-layer memory architecture and nine cognitive axes.
- Claude Opus 4.6 won $15,730 with 14 first-place finishes but ranked fifth in mean axis score.
- Multi-axis evaluation reveals capability structures often misrepresented by scalar leaderboards.
- Persistent memory benefits some models while hindering others, affecting overall performance.
- The study involved seven frontier models across 50 sessions of 1,000 hands.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 13815v1 Announce Type: new Abstract: Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the capability structure of frontier LLMs unexamined.
We introduce Poker Arena, a no-limit Texas Hold'em tournament platform that couples a three-layer memory architecture (within-hand, session, and cross-session) with a nine-axis cognitive profile decomposing strategic reasoning into interpretable dimensions such as bet-sizing calibration and positional awareness. We evaluate seven frontier models across 50 sessions of 1,000 hands and a controlled memory ablation; tournament chips and aggregate axis score order the field differently: Claude Opus 4.
6 wins +$15,730 chips with 14 first-place finishes, yet ranks only fifth of seven on mean axis score, while persistent memory helps some models and hurts others. These findings show that multi-axis evaluation surfaces capability structure that scalar leaderboards systematically misrank, with cross-dimensional consistency outweighing peak performance on any single axis.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.