One-shot transfer of dexterous grasps to novel scenes with object and context variations has been a challenging problem. While distilled feature fields from large vision models have enabled semantic correspondences across 3D scenes, their features are point-based and restricted to object surfaces, limiting their capability of modeling complex semantic feature distributions for hand-object interactions. In this work, we propose the \textit{neural attention field} for representing semantic-aware dense feature fields in the 3D space by modeling inter-point relevance instead of individual point features. Core to it is a transformer decoder that computes the cross-attention between any 3D query point with all the scene points, and provides the query point feature with an attention-based aggregation. We further propose a self-supervised framework for training the transformer decoder from only a few 3D pointclouds without hand demonstrations. Post-training, the attention field can be applied to novel scenes for semantics-aware dexterous grasping from one-shot demonstration. Experiments show that our method provides better optimization landscapes by encouraging the end-effector to focus on task-relevant scene regions, resulting in significant improvements in success rates on real robots compared with the feature-field-based methods.
翻译:将灵巧抓取单次迁移至具有物体和上下文变化的新场景一直是一个具有挑战性的问题。虽然从大型视觉模型中提取的特征场能够实现跨3D场景的语义对应,但这些特征是基于点的且局限于物体表面,限制了其对手-物交互的复杂语义特征分布进行建模的能力。本工作中,我们提出\textit{神经注意力场},通过建模点间关联性而非单个点特征,来表示3D空间中的语义感知密集特征场。其核心是一个Transformer解码器,它计算任意3D查询点与场景中所有点之间的交叉注意力,并通过基于注意力的聚合为查询点提供特征。我们进一步提出一个自监督框架,用于仅从少量3D点云数据(无需手部演示)中训练该Transformer解码器。训练完成后,该注意力场可应用于新场景,实现基于单次演示的语义感知灵巧抓取。实验表明,我们的方法通过鼓励末端执行器聚焦于任务相关的场景区域,提供了更优的优化景观,与基于特征场的方法相比,在真实机器人上的成功率有显著提升。