OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

Fangzhi Xu,Hang Yan,Qiushi Sun,Jinyang Wu,Zixian Huang,Muye Huang,Jingyang Gong,Zichen Ding,Kanzhi Cheng,Yian Wang,Xinyu Che,Zeyi Sun,Jian Zhang,Zhangyue Yin,Haoran Luo,Xuanjing Huang,Ben Kao,Jun Liu,Qika Lin

from arxiv, 34 pages

The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena-Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery. Pushing further, we introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons (e.g., > 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at https://github.com/xufangzhi/Odyssey-Arena

翻译：大语言模型（LLM）的快速发展推动了能够在复杂环境中自主导航的智能体研发。然而，现有评估主要采用演绎范式，即智能体在有限规划视野内，基于明确提供的规则和静态目标执行任务。这忽视了智能体从经验中自主发现潜在状态转移规律的归纳需求，而该能力是实现智能体前瞻性决策与维持战略连贯性的基石。为弥补这一差距，我们提出了奥德赛竞技场，将智能体评估重心重新定位于长视野、主动且具有归纳性的交互过程。我们形式化并实例化了四项基本交互原语，将抽象的状态转移动态转化为具体的交互环境。在此基础上，我们构建了用于标准化基准测试的奥德赛竞技场-轻量版，提供包含120项任务的测试集，以衡量智能体的归纳效率与长视野探索能力。进一步地，我们推出奥德赛竞技场-挑战版，用于压力测试智能体在极端交互视野（例如>200步）下的稳定性。通过对15个以上领先大语言模型的广泛实验，我们发现即使前沿模型在归纳场景中仍存在明显缺陷，这揭示了复杂环境中实现自主探索能力的关键瓶颈。我们的代码与数据公开于 https://github.com/xufangzhi/Odyssey-Arena