One critical challenge in 6D object pose estimation from a single RGBD image is efficient integration of two different modalities, i.e., color and depth. In this work, we tackle this problem by a novel Deep Fusion Transformer~(DFTr) block that can aggregate cross-modality features for improving pose estimation. Unlike existing fusion methods, the proposed DFTr can better model cross-modality semantic correlation by leveraging their semantic similarity, such that globally enhanced features from different modalities can be better integrated for improved information extraction. Moreover, to further improve robustness and efficiency, we introduce a novel weighted vector-wise voting algorithm that employs a non-iterative global optimization strategy for precise 3D keypoint localization while achieving near real-time inference. Extensive experiments show the effectiveness and strong generalization capability of our proposed 3D keypoint voting algorithm. Results on four widely used benchmarks also demonstrate that our method outperforms the state-of-the-art methods by large margins.
翻译:从单张RGBD图像中进行6D物体姿态估计的一个关键挑战在于颜色与深度两种模态的高效融合。针对该问题,本文提出一种新颖的深度融合Transformer(DFTr)模块,该模块能够聚合跨模态特征以提升姿态估计性能。与现有融合方法不同,所提出的DFTr通过利用跨模态语义相似性来更好地建模语义关联,从而将来自不同模态的全局增强特征进行更优整合,以改进信息提取能力。此外,为进一步提升鲁棒性与效率,我们引入一种新颖的加权向量投票算法,该算法采用非迭代全局优化策略实现精确的三维关键点定位,同时达到近实时推理速度。大量实验表明,所提出的三维关键点投票算法具有优异的有效性和强泛化能力。在四个广泛使用的基准数据集上的结果也证明,我们的方法以较大优势超越了现有最优方法。