Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

arXiv cs.AI·Pratham Singla, Shivank Garg, Vihan Singh

6h ago

·~1 min·6/15/2026·en·0

Quick Answer

Quick Take

The Poker Arena introduces a novel platform for evaluating LLMs using a multi-axis cognitive profile in Texas Hold'em, revealing that Claude Opus 4.6, despite winning $15,730, ranks fifth in mean axis score. This highlights the inadequacy of scalar leaderboards in capturing model capabilities, as persistent memory impacts performance variably across models.

Key Points

Poker Arena evaluates LLMs using a three-layer memory architecture and nine cognitive axes.
Claude Opus 4.6 won $15,730 with 14 first-place finishes but ranked fifth in mean axis score.
Multi-axis evaluation reveals capability structures often misrepresented by scalar leaderboards.
Persistent memory benefits some models while hindering others, affecting overall performance.
The study involved seven frontier models across 50 sessions of 1,000 hands.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2606. 13815v1 Announce Type: new Abstract: Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the capability structure of frontier LLMs unexamined.

We introduce Poker Arena, a no-limit Texas Hold'em tournament platform that couples a three-layer memory architecture (within-hand, session, and cross-session) with a nine-axis cognitive profile decomposing strategic reasoning into interpretable dimensions such as bet-sizing calibration and positional awareness. We evaluate seven frontier models across 50 sessions of 1,000 hands and a controlled memory ablation; tournament chips and aggregate axis score order the field differently: Claude Opus 4.

6 wins +$15,730 chips with 14 first-place finishes, yet ranks only fifth of seven on mean axis score, while persistent memory helps some models and hurts others. These findings show that multi-axis evaluation surfaces capability structure that scalar leaderboards systematically misrank, with cross-dimensional consistency outweighing peak performance on any single axis.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Neha Prakriya, Chaojun Hou, Zheng Gong, Huasha Zhao, Xi Zhao, Mou Li, Zhenyu Gu, Emad Barsoum

3d ago

FeaturedOriginal

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

AI Summary

Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.

#LLM #Agent #Inference #AI Startup