Learning robust and generalizable manipulation skills from demonstrations remains a key challenge in robotics, with broad applications in industrial automation and service robotics. While recent imitation learning methods have achieved impressive results, they often require large amounts of demonstration data and struggle to generalize across different spatial variants. In this work, we present a novel framework that learns manipulation skills from as few as 10 demonstrations, yet still generalizes to spatial variants such as different initial object positions and camera viewpoints. Our framework consists of two key modules: Semantic Guided Perception (SGP), which constructs task-focused, spatially aware 3D point cloud representations from RGB-D inputs; and Spatial Generalized Decision (SGD), an efficient diffusion-based decision-making module that generates actions via denoising. To effectively learn generalization ability from limited data, we introduce a critical spatially equivariant training strategy that captures the spatial knowledge embedded in expert demonstrations. We validate our framework through extensive experiments on both simulation benchmarks and real-world robotic systems. Our method demonstrates a 60 percent improvement in success rates over state-of-the-art approaches on a series of challenging tasks, even with substantial variations in object poses and camera viewpoints. This work shows significant potential for advancing efficient, generalizable manipulation skill learning in real-world applications.
翻译:从演示中学习鲁棒且可泛化的操作技能仍然是机器人学中的一个关键挑战,在工业自动化和服务机器人领域具有广泛的应用前景。尽管近期的模仿学习方法已取得令人瞩目的成果,但这些方法通常需要大量演示数据,且难以在不同空间变体间实现泛化。本研究提出了一种新颖的框架,仅需10个演示即可学习操作技能,并能泛化至不同的初始物体位置和相机视角等空间变体。该框架包含两个核心模块:语义引导感知模块,其从RGB-D输入构建任务聚焦、空间感知的三维点云表示;以及空间泛化决策模块,这是一个基于扩散的高效决策模块,通过去噪过程生成动作。为从有限数据中有效学习泛化能力,我们引入了一种关键的空间等变训练策略,以捕捉专家演示中蕴含的空间知识。我们通过在仿真基准测试和真实世界机器人系统上的大量实验验证了该框架。实验表明,即使物体位姿和相机视角存在显著变化,本方法在一系列挑战性任务上的成功率仍比现有最优方法提升60%。这项工作为推进现实应用中高效、可泛化的操作技能学习展现了巨大潜力。