Object pose estimation is a fundamental computer vision task exploited in several robotics and augmented reality applications. Many established approaches rely on predicting 2D-3D keypoint correspondences using RANSAC (Random sample consensus) and estimating the object pose using the PnP (Perspective-n-Point) algorithm. Being RANSAC non-differentiable, correspondences cannot be directly learned in an end-to-end fashion. In this paper, we address the stereo image-based object pose estimation problem by (i) introducing a differentiable RANSAC layer into a well-known monocular pose estimation network; (ii) exploiting an uncertainty-driven multi-view PnP solver which can fuse information from multiple views. We evaluate our approach on a challenging public stereo object pose estimation dataset, yielding state-of-the-art results against other recent approaches. Furthermore, in our ablation study, we show that the differentiable RANSAC layer plays a significant role in the accuracy of the proposed method. We release with this paper the open-source implementation of our method.
翻译:物体姿态估计是机器人学和增强现实等多种应用中基础性的计算机视觉任务。许多成熟方法依赖于使用RANSAC(随机抽样一致性算法)预测2D-3D关键点对应关系,并通过PnP(透视n点算法)估计物体姿态。由于RANSAC不可微分,对应关系无法以端到端方式直接学习。本文针对基于立体图像的物体姿态估计问题,提出:(i)将可微分RANSAC层引入经典单目姿态估计网络;(ii)利用基于不确定性的多视图PnP求解器融合多视角信息。我们在具有挑战性的公开立体物体姿态估计数据集上评估了所提方法,取得了优于其他近期方法的先进结果。此外,消融研究表明,可微分RANSAC层对方法精度具有显著贡献。我们随文开源了方法的实现代码。