Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the capability structure of frontier LLMs unexamined. We introduce Poker Arena, a no-limit Texas Hold'em tournament platform that couples a three-layer memory architecture (within-hand, session, and cross-session) with a nine-axis cognitive profile decomposing strategic reasoning into interpretable dimensions such as bet-sizing calibration and positional awareness. We evaluate seven frontier models across 50 sessions of 1,000 hands and a controlled memory ablation; tournament chips and aggregate axis score order the field differently: Claude Opus 4.6 wins +$15,730 chips with 14 first-place finishes, yet ranks only fifth of seven on mean axis score, while persistent memory helps some models and hurts others. These findings show that multi-axis evaluation surfaces capability structure that scalar leaderboards systematically misrank, with cross-dimensional consistency outweighing peak performance on any single axis.
翻译:在不确定性环境下进行策略推理是谈判、金融及政策等领域关键决策的基础,然而,当前基于博弈的基准测试将异质性推理维度压缩为单一标量,导致前沿大语言模型(LLM)的能力结构尚未得到充分探究。我们提出Poker Arena——一个无上限德州扑克竞技平台,其融合三层记忆架构(局内记忆、场次记忆及跨场次记忆)与九轴认知画像,将策略推理分解为可解释的维度(如下注尺度校准及位置意识)。我们针对七个前沿模型,在50场各含1000手牌局的比赛中开展评估,并实施受控记忆消融实验;锦标赛积分与聚合轴评分对模型的排序结果存在差异:Claude Opus 4.6以14次夺冠战绩获得+15,730筹码收益,但其平均轴评分在七个模型中仅列第五位,而持久记忆对部分模型产生增益,对另一些模型则造成损害。上述结果表明,多轴评估能够揭示标量排行榜系统性误判的能力结构,且跨维度一致性优于任何单一维度的峰值性能。