In many applications of advanced robotic manipulation, six degrees of freedom (6DoF) object pose estimates are continuously required. In this work, we develop a multi-modality tracker that fuses information from visual appearance and geometry to estimate object poses. The algorithm extends our previous method ICG, which uses geometry, to additionally consider surface appearance. In general, object surfaces contain local characteristics from text, graphics, and patterns, as well as global differences from distinct materials and colors. To incorporate this visual information, two modalities are developed. For local characteristics, keypoint features are used to minimize distances between points from keyframes and the current image. For global differences, a novel region approach is developed that considers multiple regions on the object surface. In addition, it allows the modeling of external geometries. Experiments on the YCB-Video and OPT datasets demonstrate that our approach ICG+ performs best on both datasets, outperforming both conventional and deep learning-based methods. At the same time, the algorithm is highly efficient and runs at more than 300 Hz. The source code of our tracker is publicly available.
翻译:在许多高级机器人操作应用中,需要连续估计六自由度(6DoF)物体位姿。本文开发了一种多模态跟踪器,通过融合视觉外观与几何信息来估计物体位姿。该算法扩展了我们先前仅利用几何信息的ICG方法,进一步考虑了表面外观特征。通常,物体表面既包含来自文字、图形和图案的局部特征,也包含不同材质和颜色带来的全局差异。为整合这些视觉信息,我们开发了两种模态:针对局部特征,采用关键点特征来最小化关键帧与当前图像中对应点之间的距离;针对全局差异,提出了一种新颖的区域方法,该方法考虑物体表面的多个区域,同时支持外部几何建模。在YCB-Video和OPT数据集上的实验表明,我们的方法ICG+在两个数据集上均取得了最佳性能,超越了传统方法和基于深度学习的方法。同时,该算法效率极高,运行速度超过300 Hz。我们跟踪器的源代码已公开。