The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

Seth Karten,Jake Grigsby,Tersoo Upaa,Junik Bae,Seonghun Hong,Hyunyoung Jeong,Jaeyoon Jung,Kun Kerdthaisong,Gyungbo Kim,Hyeokgi Kim,Yujin Kim,Eunju Kwon,Dongyu Liu,Patrick Mariglia,Sangyeon Park,Benedikt Schink,Xianwei Shi,Anthony Sistilli,Joseph Twin,Arian Urdu,Matin Urdu,Qiao Wang,Ling Wu,Wenli Zhang,Kunsheng Zhou,Stephanie Milani,Kiran Vodrahalli,Amy Zhang,Fei Fang,Yuke Zhu,Chi Jin

from arxiv, 41 pages, 26 figures, 5 tables. NeurIPS 2025 Competition Track

We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.

翻译：我们提出PokeAgent挑战赛，这是一个基于《宝可梦》多智能体对战系统与广阔角色扮演游戏（RPG）环境构建的大规模决策研究基准。部分可观测性、博弈论推理与长程规划仍是前沿AI领域的开放性问题，但现有基准鲜少能在现实条件下同时考察这三项能力。PokeAgent通过两个互补赛道大规模针对这些局限展开研究：对战赛道要求在竞争性宝可梦对战中进行部分可观测条件下的策略推理与泛化，速通赛道则要求在宝可梦RPG中完成长程规划与序列决策。对战赛道提供超过2000万条对战轨迹数据集，以及一套具备高水平竞技能力的启发式、强化学习与基于大语言模型的基线系统。速通赛道首次为RPG速通提供标准化评估框架，包括开源的多智能体编排系统，可用于基于框架的大语言模型方法的模块化、可复现比较。我们在NeurIPS 2025举办的竞赛验证了资源质量与研究社区对宝可梦课题的关注度，双赛道共吸引超100支队伍参赛，获奖方案细节已在论文中详述。参赛提交结果与基线系统表明，通用模型（大语言模型）、专用模型（强化学习）与顶尖人类表现之间存在显著差距。通过BenchPress评估矩阵分析显示，宝可梦对战能力与标准大语言模型基准近乎正交，其衡量的能力未被现有基准覆盖，这使宝可梦成为能推动强化学习与大语言模型研究的未解基准。我们已将其转化为持续更新的动态基准，对战赛道设有实时排行榜，速通赛道提供自包含评估系统，详见https://pokeagentchallenge.com。