Advances in behavior cloning (BC), like action-chunking and diffusion, have enabled impressive capabilities. Still, imitation alone remains insufficient for learning reliable policies for tasks requiring precise aligning and inserting of objects, like assembly. Our key insight is that chunked BC policies effectively function as trajectory planners, enabling long-horizon tasks. Conversely, as they execute action chunks open-loop, they lack the fine-grained reactivity necessary for reliable execution. Further, we find that the performance of BC policies saturates despite increasing data. Reinforcement learning (RL) is a natural way to overcome BC's limitations, but it is not straightforward to apply directly to action-chunked models like diffusion policies. We present a simple yet effective method, ResiP (Residual for Precise Manipulation), that sidesteps these challenges by augmenting a frozen, chunked BC model with a fully closed-loop residual policy trained with RL. The residual policy is trained via on-policy RL, addressing distribution shifts and introducing reactive control without altering the BC trajectory planner. Evaluation on high-precision manipulation tasks demonstrates strong performance of ResiP over BC methods and direct RL fine-tuning. Videos, code, and data are available at https://residual-assembly.github.io.
翻译:行为克隆(BC)领域的最新进展,如动作分块和扩散策略,已展现出令人瞩目的能力。然而,仅靠模仿学习仍不足以掌握需要精确对齐和插入物体(如装配任务)的可靠策略。我们的核心见解是:分块式BC策略本质上充当了轨迹规划器,能够处理长时程任务;但由于其以开环方式执行动作块,它们缺乏可靠执行所需的细粒度反应能力。此外,我们发现BC策略的性能在数据量增加后趋于饱和。强化学习(RL)是克服BC局限性的自然途径,但将其直接应用于扩散策略等动作分块模型并非易事。我们提出了一种简单而有效的方法——ResiP(用于精密操作的残差策略),该方法通过为冻结的分块BC模型增加一个完全闭环的、通过RL训练的残差策略,从而规避了这些挑战。残差策略通过同策略RL进行训练,既能应对分布偏移,又能引入反应性控制,同时不改变BC轨迹规划器。在高精度操作任务上的评估表明,ResiP相较于BC方法和直接RL微调具有更优越的性能。视频、代码和数据可在 https://residual-assembly.github.io 获取。