Learning from demonstrations faces challenges in generalizing beyond the training data and is fragile even to slight visual variations. To tackle this problem, we introduce Lan-o3dp, a language guided object centric diffusion policy that takes 3d representation of task relevant objects as conditional input and can be guided by cost function for safety constraints at inference time. Lan-o3dp enables strong generalization in various aspects, such as background changes, visual ambiguity and can avoid novel obstacles that are unseen during the demonstration process. Specifically, We first train a diffusion policy conditioned on point clouds of target objects and then harness a large language model to decompose the user instruction into task related units consisting of target objects and obstacles, which can be used as visual observation for the policy network or converted to a cost function, guiding the generation of trajectory towards collision free region at test time. Our proposed method shows training efficiency and higher success rates compared with the baselines in simulation experiments. In real world experiments, our method exhibits strong generalization performance towards unseen instances, cluttered scenes, scenes of multiple similar objects and demonstrates training free capability of obstacle avoidance.
翻译:从演示中学习面临泛化能力不足的挑战,即使面对细微的视觉变化也表现脆弱。为解决这一问题,我们提出了Lan-o3dp——一种语言引导的物体中心扩散策略。该方法以任务相关物体的三维表征作为条件输入,并可在推理阶段通过代价函数引导以满足安全约束。Lan-o3dp在背景变化、视觉模糊性等多方面展现出强大的泛化能力,并能规避演示过程中未见过的新障碍物。具体而言,我们首先训练以目标物体点云为条件的扩散策略,随后利用大语言模型将用户指令分解为包含目标物体与障碍物的任务单元。这些单元既可作为策略网络的视觉观测输入,也可转化为代价函数,在测试阶段引导轨迹生成至无碰撞区域。仿真实验表明,与基线方法相比,我们提出的方法具有更高的训练效率和任务成功率。在真实世界实验中,该方法对未见过的物体实例、杂乱场景、多相似物体场景均表现出优异的泛化性能,并展示了无需额外训练的避障能力。