Recent advances in agentic Large Language Models (LLMs) have positioned them as generalist planners capable of reasoning and acting across diverse tasks. However, existing agent benchmarks largely focus on symbolic or weakly grounded environments, leaving their performance in physics-constrained real-world domains underexplored. We introduce AstroReason-Bench, a comprehensive benchmark for evaluating agentic planning in Space Planning Problems (SPP), a family of high-stakes problems with heterogeneous objectives, strict physical constraints, and long-horizon decision-making. AstroReason-Bench integrates multiple scheduling regimes, including ground station communication and agile Earth observation, and provides a unified agent-oriented interaction protocol. Evaluating on a range of state-of-the-art open- and closed-source agentic LLM systems, we find that current agents substantially underperform specialized solvers, highlighting key limitations of generalist planning under realistic constraints. AstroReason-Bench offers a challenging and diagnostic testbed for future agentic research.
翻译:近期,智能体大语言模型(LLMs)的进展已将其定位为能够在多样化任务中进行推理和行动的通用规划器。然而,现有的智能体基准测试主要集中于符号化或弱接地环境,导致其在受物理约束的真实世界领域中的性能尚未得到充分探索。本文介绍了AstroReason-Bench,这是一个用于评估空间规划问题(SPP)中智能体规划能力的综合性基准。SPP是一类具有异构目标、严格物理约束和长周期决策的高风险问题。AstroReason-Bench整合了多种调度机制,包括地面站通信和敏捷对地观测,并提供了一种统一的面向智能体的交互协议。通过对一系列最先进的开源和闭源智能体LLM系统进行评估,我们发现当前智能体的性能显著低于专用求解器,这凸显了在现实约束下通用规划器的关键局限性。AstroReason-Bench为未来的智能体研究提供了一个具有挑战性和诊断性的测试平台。