EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots

The field of Embodied AI is witnessing a rapid evolution toward general-purpose robotic systems, fueled by high-fidelity simulation and large-scale data collection. However, this scaling capability remains severely bottlenecked by a reliance on labor-intensive manual oversight from intricate reward shaping to hyperparameter tuning across heterogeneous backends. Inspired by LLMs' success in software automation and science discovery, we introduce \textsc{EmboCoach-Bench}, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies. Spanning 32 expert-curated RL and IL tasks, our framework posits executable code as the universal interface. We move beyond static generation to assess a dynamic closed-loop workflow, where agents leverage environment feedback to iteratively draft, debug, and optimize solutions, spanning improvements from physics-informed reward design to policy architectures such as diffusion policies. Extensive evaluations yield three critical insights: (1) autonomous agents can qualitatively surpass human-engineered baselines by 26.5\% in average success rate; (2) agentic workflow with environment feedback effectively strengthens policy development and substantially narrows the performance gap between open-source and proprietary models; and (3) agents exhibit self-correction capabilities for pathological engineering cases, successfully resurrecting task performance from near-total failures through iterative simulation-in-the-loop debugging. Ultimately, this work establishes a foundation for self-evolving embodied intelligence, accelerating the paradigm shift from labor-intensive manual tuning to scalable, autonomous engineering in embodied AI field.

翻译：具身人工智能领域正朝着通用机器人系统方向快速发展，这得益于高保真仿真和大规模数据收集的推动。然而，这种扩展能力仍然严重依赖于劳动密集型的人工监督，从复杂的奖励塑形到跨异构后端的超参数调优，构成了主要瓶颈。受大语言模型在软件自动化和科学发现方面成功的启发，我们引入了 \textsc{EmboCoach-Bench}，这是一个评估大语言模型智能体自主设计具身策略能力的基准。该框架涵盖32个由专家策划的强化学习和模仿学习任务，并将可执行代码设定为通用接口。我们超越了静态生成，评估一个动态的闭环工作流程，其中智能体利用环境反馈来迭代地起草、调试和优化解决方案，改进范围涵盖从基于物理的奖励设计到扩散策略等策略架构。广泛的评估得出了三个关键发现：（1）自主智能体在平均成功率上可以定性超越人工设计的基线26.5%；（2）结合环境反馈的智能体工作流程能有效加强策略开发，并显著缩小开源模型与专有模型之间的性能差距；（3）智能体对病态工程案例展现出自我纠正能力，能够通过迭代的仿真内环调试，成功将任务性能从近乎完全失败中恢复。最终，这项工作为自演化的具身智能奠定了基础，加速了具身AI领域从劳动密集型人工调优向可扩展的自主工程范式的转变。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

智能体评判者（Agent-as-a-Judge）研究综述

专知会员服务

37+阅读 · 1月9日

智能体化人工智能 (Agentic AI) 的前行之路：挑战与机遇

专知会员服务

43+阅读 · 1月8日

自进化人工智能体的全面综述：连接基础模型与终身自主智能系统的新范式

专知会员服务

35+阅读 · 2025年12月28日

面向具身操作的视觉-语言-动作模型综述

专知会员服务

28+阅读 · 2025年8月23日