Reinforcement Learning (RL) offers a powerful paradigm for autonomous robots to master generalist manipulation skills through trial-and-error. However, its real-world application is stifled by severe sample inefficiency. Recent Human-in-the-Loop (HIL) methods accelerate training by using human corrections, yet this approach faces a scalability barrier. Reliance on human supervisors imposes a 1:1 supervision ratio that limits fleet expansion, suffers from operator fatigue over extended sessions, and introduces high variance due to inconsistent human proficiency. We present Agent-guided Policy Search (AGPS), a framework that automates the training pipeline by replacing human supervisors with a multimodal agent. Our key insight is that the agent can be viewed as a semantic world model, injecting intrinsic value priors to structure physical exploration. By using executable tools, the agent provides precise guidance via corrective waypoints and spatial constraints for exploration pruning. We validate our approach on two tasks, ranging from precision insertion to deformable object manipulation. Results demonstrate that AGPS outperforms HIL methods in sample efficiency. This automates the supervision pipeline, unlocking the path to labor-free and scalable robot learning. Project website: https://agps-rl.github.io/agps.
翻译:强化学习(RL)为自主机器人通过试错掌握通用操作技能提供了强大范式。然而,其在现实世界中的应用因严重的样本效率低下而受到制约。近期的人机协同(HIL)方法通过利用人类纠正指令来加速训练,但该方法面临可扩展性瓶颈。对人工监督者的依赖导致1:1的监督比例,限制了机器人集群的扩展,在长时间操作中易受操作者疲劳影响,且因人类技能差异引入高方差。本文提出智能体引导策略搜索(AGPS)框架,通过多模态智能体替代人类监督者,实现训练流程自动化。我们的核心观点是:智能体可被视为语义世界模型,通过注入内在价值先验来结构化物理探索过程。通过调用可执行工具,智能体能够提供精确的纠正路径点与空间约束,实现探索空间的剪枝。我们在两项任务(从精密插接到可变形物体操作)上验证了该方法。实验结果表明,AGPS在样本效率上优于HIL方法。该框架实现了监督流程自动化,为无人工干预、可扩展的机器人学习开辟了道路。项目网站:https://agps-rl.github.io/agps。