Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability-utility trade-offs in realistic economic settings.
翻译:长时程规划被广泛认为是基于大语言模型的自主智能体的核心能力;然而,当前的评估框架普遍存在局限性,主要表现为任务多为片段式、领域特定,或未能充分根植于持续的经济动态之中。我们提出了EcoGym,一个用于在交互式经济中进行连续规划与执行决策的通用化基准。EcoGym包含三个多样化环境:自动售货、自由职业和运营,它们通过统一的决策流程和标准化接口实现,并在一个有效无界的时域内(若以365天为评估循环,则超过1000步)提供有预算约束的行动。EcoGym的评估基于与业务相关的成果(例如,净资产、收入和日活跃用户数),旨在考察智能体在部分可观测性和随机性下的长期战略连贯性与鲁棒性。对十一个领先大语言模型的实验揭示了一个系统性的矛盾:没有一个模型能在所有三种场景中均占主导地位。关键的是,我们发现模型要么在高层战略上,要么在高效行动执行上表现出显著的次优性。EcoGym已作为开放、可扩展的测试平台发布,用于透明化的长时程智能体评估,以及在现实经济环境中研究可控性与效用之间的权衡。