Reinforcement Learning (RL) offers a powerful paradigm for autonomous robots to master generalist manipulation skills through trial-and-error. However, its real-world application is stifled by low sample efficiency. Recent Human-in-the-Loop (HIL) methods accelerate training by using human corrections, yet this approach faces a scalability barrier. Reliance on human supervisors imposes a 1:1 supervision ratio that limits scalability, suffers from operator fatigue over extended sessions, and introduces high variance due to inconsistent human proficiency. We present Agent-guided Policy Search (AGPS), a framework that automates the training pipeline by replacing human supervisors with a multimodal agent. Our key insight is that the agent can be viewed as a semantic world model, injecting intrinsic value priors to structure physical exploration. By using tools, the agent provides precise guidance via corrective waypoints and spatial constraints for exploration pruning. We validate our approach on three tasks, ranging from precision insertion to deformable object manipulation. Results demonstrate that AGPS outperforms HIL methods in sample efficiency. This automates the supervision pipeline, unlocking the path to labor-free and scalable robot learning. Project website: https://agps-rl.github.io/agps/.
翻译:强化学习(Reinforcement Learning, RL)为自主机器人通过试错掌握通用操作技能提供了强大的范式。然而,其在实际应用中的样本效率低下问题阻碍了发展。近期的人机协同(Human-in-the-Loop, HIL)方法通过利用人类纠正指令来加速训练,但该方法面临可扩展性瓶颈。对人工监督员的依赖导致了1:1的监督比例,限制了可扩展性,在长时间操作中易受操作者疲劳影响,并且由于人类熟练程度不一致而引入高方差。我们提出了智能体引导策略搜索(Agent-guided Policy Search, AGPS),这是一个通过使用多模态智能体替代人类监督员来自动化训练流程的框架。我们的核心见解是将智能体视为一个语义世界模型,通过注入内在价值先验来结构化物理探索。通过使用工具,智能体能够提供精确的引导,包括纠正性路径点和对探索过程进行剪枝的空间约束。我们在三项任务上验证了所提方法,涵盖精密插接到可变形物体操作。实验结果表明,AGPS在样本效率上优于HIL方法。这实现了监督流程的自动化,为通向无需人工且可扩展的机器人学习开辟了道路。项目网站:https://agps-rl.github.io/agps/。