Planar grasp detection is one of the most fundamental tasks to robotic manipulation, and the recent progress of consumer-grade RGB-D sensors enables delivering more comprehensive features from both the texture and shape modalities. However, depth maps are generally of a relatively lower quality with much stronger noise compared to RGB images, making it challenging to acquire grasp depth and fuse multi-modal clues. To address the two issues, this paper proposes a novel learning based approach to RGB-D grasp detection, namely Depth Guided Cross-modal Attention Network (DGCAN). To better leverage the geometry information recorded in the depth channel, a complete 6-dimensional rectangle representation is adopted with the grasp depth dedicatedly considered in addition to those defined in the common 5-dimensional one. The prediction of the extra grasp depth substantially strengthens feature learning, thereby leading to more accurate results. Moreover, to reduce the negative impact caused by the discrepancy of data quality in two modalities, a Local Cross-modal Attention (LCA) module is designed, where the depth features are refined according to cross-modal relations and concatenated to the RGB ones for more sufficient fusion. Extensive simulation and physical evaluations are conducted and the experimental results highlight the superiority of the proposed approach.
翻译:平面抓取检测是机器人操作中最基础的任务之一,近年来消费级RGB-D传感器的发展使得从纹理和形状模态中获取更全面的特征成为可能。然而,与RGB图像相比,深度图通常质量较低且噪声更强,这给抓取深度的获取和多模态线索的融合带来了挑战。针对这两个问题,本文提出了一种新颖的基于学习的RGB-D抓取检测方法,即深度引导的跨模态注意力网络(DGCAN)。为更好地利用深度通道记录的几何信息,本文在通用五维矩形表征基础上额外考虑了抓取深度,采用了完整的六维矩形表征。额外抓取深度的预测显著增强了特征学习,从而获得更精确的结果。此外,为减少两种模态数据质量差异带来的负面影响,设计了局部跨模态注意力(LCA)模块,该模块根据跨模态关系优化深度特征,并与RGB特征拼接以实现更充分的融合。通过大量仿真和实际机器人实验,结果证明了所提方法的优越性。