Behavior cloning (BC) currently stands as a dominant paradigm for learning real-world visual manipulation. However, in tasks that require locally corrective behaviors like multi-part assembly, learning robust policies purely from human demonstrations remains challenging. Reinforcement learning (RL) can mitigate these limitations by allowing policies to acquire locally corrective behaviors through task reward supervision and exploration. This paper explores the use of RL fine-tuning to improve upon BC-trained policies in precise manipulation tasks. We analyze and overcome technical challenges associated with using RL to directly train policy networks that incorporate modern architectural components like diffusion models and action chunking. We propose training residual policies on top of frozen BC-trained diffusion models using standard policy gradient methods and sparse rewards, an approach we call ResiP (Residual for Precise manipulation). Our experimental results demonstrate that this residual learning framework can significantly improve success rates beyond the base BC-trained models in high-precision assembly tasks by learning corrective actions. We also show that by combining ResiP with teacher-student distillation and visual domain randomization, our method can enable learning real-world policies for robotic assembly directly from RGB images. Find videos and code at \url{https://residual-assembly.github.io}.
翻译:行为克隆(BC)目前是学习真实世界视觉操作的主流范式。然而,在需要局部校正行为(如多部件装配)的任务中,仅从人类演示中学习鲁棒策略仍具挑战性。强化学习(RL)可通过任务奖励监督和探索使策略习得局部校正行为,从而缓解这些局限性。本文探索了使用RL微调来改进BC训练策略在精确操作任务中的表现。我们分析并克服了与使用RL直接训练包含扩散模型和动作分块等现代架构组件的策略网络相关的技术挑战。我们提出在冻结的BC训练扩散模型之上,使用标准策略梯度方法和稀疏奖励训练残差策略,该方法我们称之为ResiP(用于精确操作的残差)。我们的实验结果表明,该残差学习框架通过学习校正动作,能在高精度装配任务中显著提高成功率,超越基础BC训练模型。我们还证明,通过将ResiP与师生蒸馏和视觉域随机化相结合,我们的方法能够直接从RGB图像学习机器人装配的真实世界策略。视频和代码请访问 \url{https://residual-assembly.github.io}。