While imitation learning (IL) offers a promising framework for teaching robots various behaviors, learning complex tasks remains challenging. Existing IL policies struggle to generalize effectively across visual and spatial variations even for simple tasks. In this work, we introduce SPHINX: Salient Point-based Hybrid ImitatioN and eXecution, a flexible IL policy that leverages multimodal observations (point clouds and wrist images), along with a hybrid action space of low-frequency, sparse waypoints and high-frequency, dense end effector movements. Given 3D point cloud observations, SPHINX learns to infer task-relevant points within a point cloud, or salient points, which support spatial generalization by focusing on semantically meaningful features. These salient points serve as anchor points to predict waypoints for long-range movement, such as reaching target poses in free-space. Once near a salient point, SPHINX learns to switch to predicting dense end-effector movements given close-up wrist images for precise phases of a task. By exploiting the strengths of different input modalities and action representations for different manipulation phases, SPHINX tackles complex tasks in a sample-efficient, generalizable manner. Our method achieves 86.7% success across 4 real-world and 2 simulated tasks, outperforming the next best state-of-the-art IL baseline by 41.1% on average across 440 real world trials. SPHINX additionally generalizes to novel viewpoints, visual distractors, spatial arrangements, and execution speeds with a 1.7x speedup over the most competitive baseline. Our website (http://sphinx-manip.github.io) provides open-sourced code for data collection, training, and evaluation, along with supplementary videos.
翻译:尽管模仿学习(IL)为教导机器人各种行为提供了一个有前景的框架,但学习复杂任务仍然具有挑战性。现有的IL策略即使在简单任务上也难以有效泛化到视觉和空间变化中。在本工作中,我们提出了SPHINX:基于显著点的混合模仿与执行,这是一种灵活的IL策略,它利用多模态观测(点云和腕部图像)以及一个混合动作空间,该空间包含低频、稀疏的路径点和高频、密集的末端执行器运动。给定3D点云观测,SPHINX学习推断点云中与任务相关的点,即显著点,这些点通过关注语义上有意义的特征来支持空间泛化。这些显著点作为锚点,用于预测长距离运动的路径点,例如在自由空间中到达目标姿态。一旦接近一个显著点,SPHINX学习切换到根据近距离腕部图像预测密集的末端执行器运动,以执行任务的精确阶段。通过针对不同的操作阶段利用不同输入模态和动作表示的优势,SPHINX以样本高效、可泛化的方式处理复杂任务。我们的方法在4个真实世界任务和2个模拟任务中实现了86.7%的成功率,在440次真实世界试验中,平均优于次优的先进IL基线41.1%。此外,SPHINX能够泛化到新颖的视角、视觉干扰物、空间布局和执行速度,相对于最具竞争力的基线实现了1.7倍的加速。我们的网站(http://sphinx-manip.github.io)提供了用于数据收集、训练和评估的开源代码以及补充视频。