Humans demonstrate remarkable skill in transferring manipulation abilities across objects of varying shapes, poses, and appearances, a capability rooted in their understanding of semantic correspondences between different instances. To equip robots with a similar high-level comprehension, we present SparseDFF, a novel DFF for 3D scenes utilizing large 2D vision models to extract semantic features from sparse RGBD images, a domain where research is limited despite its relevance to many tasks with fixed-camera setups. SparseDFF generates view-consistent 3D DFFs, enabling efficient one-shot learning of dexterous manipulations by mapping image features to a 3D point cloud. Central to SparseDFF is a feature refinement network, optimized with a contrastive loss between views and a point-pruning mechanism for feature continuity. This facilitates the minimization of feature discrepancies w.r.t. end-effector parameters, bridging demonstrations and target manipulations. Validated in real-world scenarios with a dexterous hand, SparseDFF proves effective in manipulating both rigid and deformable objects, demonstrating significant generalization capabilities across object and scene variations.
翻译:人类在跨不同形状、姿态和外观的物体间迁移操作能力方面展现了卓越技巧,这一能力植根于他们对不同实例间语义对应关系的理解。为使机器人具备类似的深层理解能力,我们提出了SparseDFF——一种利用大型二维视觉模型从稀疏RGBD图像中提取语义特征的新型三维场景密集特征场(DFF)。尽管该领域与固定相机设置下的诸多任务密切相关,但相关研究仍显不足。SparseDFF通过将图像特征映射至三维点云,生成视角一致的三维DFF,从而支持灵巧操作的高效单次学习。其核心是采用视角间对比损失和特征连续性点剪枝机制进行优化的特征精炼网络,这有助于最小化末端执行器参数相关特征差异,桥接演示与目标操作间的鸿沟。经实际场景灵巧手验证,SparseDFF在操作刚性和可变形物体时均表现有效,并展现出跨物体与场景变化的显著泛化能力。