Diffusion policies have demonstrated robust performance in generative modeling, prompting their application in robotic manipulation controlled via language descriptions. In this paper, we introduce a zero-shot, open-vocabulary diffusion policy method for robot manipulation. Using Vision-Language Models (VLMs), our method transforms linguistic task descriptions into actionable keyframes in 3D space. These keyframes serve to guide the diffusion process via inpainting. However, naively enforcing the diffusion process to adhere to the generated keyframes is problematic: the keyframes from the VLMs may be incorrect and lead to out-of-distribution (OOD) action sequences where the diffusion model performs poorly. To address these challenges, we develop an inpainting optimization strategy that balances adherence to the keyframes v.s. the training data distribution. Experimental evaluations demonstrate that our approach surpasses the performance of traditional fine-tuned language-conditioned methods in both simulated and real-world settings.
翻译:扩散策略在生成建模中展现出鲁棒性能,这促使研究者将其应用于通过语言描述控制的机器人操控任务。本文提出一种零样本、开放词汇的机器人操控扩散策略方法。该方法利用视觉-语言模型(VLMs)将语言任务描述转化为三维空间中的可执行关键帧,这些关键帧通过图像修复机制引导扩散过程。然而,强制扩散过程完全遵循生成的关键帧存在缺陷:视觉-语言模型生成的关键帧可能存在错误,并导致扩散模型在分布外(OOD)动作序列上表现不佳。为解决这些问题,我们开发了一种修复优化策略,该策略在遵循关键帧与保持训练数据分布之间实现平衡。实验评估表明,我们的方法在仿真和真实场景中均超越了传统微调语言条件方法的性能。