Can LLMs Do Rocket Science? Exploring the Limits of Complex Reasoning with GTOC 12

Large Language Models (LLMs) have demonstrated remarkable proficiency in code generation and general reasoning, yet their capacity for autonomous multi-stage planning in high-dimensional, physically constrained environments remains an open research question. This study investigates the limits of current AI agents by evaluating them against the 12th Global Trajectory Optimization Competition (GTOC 12), a complex astrodynamics challenge requiring the design of a large-scale asteroid mining campaign. We adapt the MLE-Bench framework to the domain of orbital mechanics and deploy an AIDE-based agent architecture to autonomously generate and refine mission solutions. To assess performance beyond binary validity, we employ an "LLM-as-a-Judge" methodology, utilizing a rubric developed by domain experts to evaluate strategic viability across five structural categories. A comparative analysis of models, ranging from GPT-4-Turbo to reasoning-enhanced architectures like Gemini 2.5 Pro, and o3, reveals a significant trend: the average strategic viability score has nearly doubled in the last two years (rising from 9.3 to 17.2 out of 26). However, we identify a critical capability gap between strategy and execution. While advanced models demonstrate sophisticated conceptual understanding, correctly framing objective functions and mission architectures, they consistently fail at implementation due to physical unit inconsistencies, boundary condition errors, and inefficient debugging loops. We conclude that, while current LLMs often demonstrate sufficient knowledge and intelligence to tackle space science tasks, they remain limited by an implementation barrier, functioning as powerful domain facilitators rather than fully autonomous engineers.

翻译：大型语言模型（LLMs）在代码生成与通用推理方面已展现出卓越能力，但其在高维物理约束环境中进行自主多阶段规划的能力仍是一个开放的研究问题。本研究通过将当前AI智能体置于第12届全球轨迹优化竞赛（GTOC 12）的评估框架下，探究其能力边界。该竞赛是一项复杂的航天动力学挑战，要求设计大规模小行星采矿任务方案。我们将MLE-Bench框架适配至轨道力学领域，并部署基于AIDE的智能体架构以自主生成并优化任务解决方案。为超越二值有效性评估，我们采用“LLM即评判者”方法，利用领域专家制定的评估准则，在五个结构维度上对策略可行性进行量化评估。对从GPT-4-Turbo到Gemini 2.5 Pro、o3等推理增强架构的模型对比分析揭示了一个显著趋势：过去两年间平均策略可行性得分几乎翻倍（从26分制中的9.3分上升至17.2分）。然而，我们发现了策略构想与执行实施之间的关键能力断层：先进模型虽能展现对目标函数与任务架构的深刻概念理解，却因物理单位不一致、边界条件错误及低效调试循环等问题持续在实施阶段失败。我们的结论是：尽管当前LLMs通常具备应对空间科学任务所需的知识与智能，但仍受限于实施障碍，其角色更接近强大的领域辅助工具而非完全自主的工程系统。