The field of Embodied AI is witnessing a rapid evolution toward general-purpose robotic systems, fueled by high-fidelity simulation and large-scale data collection. However, this scaling capability remains severely bottlenecked by a reliance on labor-intensive manual oversight from intricate reward shaping to hyperparameter tuning across heterogeneous backends. Inspired by LLMs' success in software automation and science discovery, we introduce \textsc{EmboCoach-Bench}, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies. Spanning 32 expert-curated RL and IL tasks, our framework posits executable code as the universal interface. We move beyond static generation to assess a dynamic closed-loop workflow, where agents leverage environment feedback to iteratively draft, debug, and optimize solutions, spanning improvements from physics-informed reward design to policy architectures such as diffusion policies. Extensive evaluations yield three critical insights: (1) autonomous agents can qualitatively surpass human-engineered baselines by 26.5\% in average success rate; (2) agentic workflow with environment feedback effectively strengthens policy development and substantially narrows the performance gap between open-source and proprietary models; and (3) agents exhibit self-correction capabilities for pathological engineering cases, successfully resurrecting task performance from near-total failures through iterative simulation-in-the-loop debugging. Ultimately, this work establishes a foundation for self-evolving embodied intelligence, accelerating the paradigm shift from labor-intensive manual tuning to scalable, autonomous engineering in embodied AI field.
翻译:具身人工智能领域正朝着通用机器人系统方向快速发展,这得益于高保真仿真和大规模数据收集的推动。然而,这种扩展能力仍然严重依赖于劳动密集型的人工监督,从复杂的奖励塑形到跨异构后端的超参数调优,构成了主要瓶颈。受大语言模型在软件自动化和科学发现方面成功的启发,我们引入了 \textsc{EmboCoach-Bench},这是一个评估大语言模型智能体自主设计具身策略能力的基准。该框架涵盖32个由专家策划的强化学习和模仿学习任务,并将可执行代码设定为通用接口。我们超越了静态生成,评估一个动态的闭环工作流程,其中智能体利用环境反馈来迭代地起草、调试和优化解决方案,改进范围涵盖从基于物理的奖励设计到扩散策略等策略架构。广泛的评估得出了三个关键发现:(1)自主智能体在平均成功率上可以定性超越人工设计的基线26.5%;(2)结合环境反馈的智能体工作流程能有效加强策略开发,并显著缩小开源模型与专有模型之间的性能差距;(3)智能体对病态工程案例展现出自我纠正能力,能够通过迭代的仿真内环调试,成功将任务性能从近乎完全失败中恢复。最终,这项工作为自演化的具身智能奠定了基础,加速了具身AI领域从劳动密集型人工调优向可扩展的自主工程范式的转变。