Object pose estimation is a fundamental computer vision task exploited in several robotics and augmented reality applications. Many established approaches rely on predicting 2D-3D keypoint correspondences using RANSAC (Random sample consensus) and estimating the object pose using the PnP (Perspective-n-Point) algorithm. Being RANSAC non-differentiable, correspondences cannot be directly learned in an end-to-end fashion. In this paper, we address the stereo image-based object pose estimation problem by i) introducing a differentiable RANSAC layer into a well-known monocular pose estimation network; ii) exploiting an uncertainty-driven multi-view PnP solver which can fuse information from multiple views. We evaluate our approach on a challenging public stereo object pose estimation dataset and a custom-built dataset we call Transparent Tableware Dataset (TTD), yielding state-of-the-art results against other recent approaches. Furthermore, in our ablation study, we show that the differentiable RANSAC layer plays a significant role in the accuracy of the proposed method. We release with this paper the code of our method and the TTD dataset.
翻译:物体姿态估计是机器人学和增强现实应用中的一项基础计算机视觉任务。许多已有方法依赖RANSAC(随机抽样一致性算法)预测2D-3D关键点对应关系,并通过PnP(透视n点)算法估计物体姿态。由于RANSAC不可微分,对应关系无法以端到端方式直接学习。本文针对基于立体图像的物体姿态估计问题,通过以下两方面进行改进:i)将可微分RANSAC层引入著名的单目姿态估计网络;ii)利用基于不确定性的多视角PnP求解器融合多视角信息。我们在具有挑战性的公开立体物体姿态估计数据集以及自建数据集(透明餐具数据集,简称TTD)上评估了该方法,相比近期其他方法取得了最优结果。此外,消融研究表明,可微分RANSAC层对提升所提方法的精度具有重要作用。本文同时公开了方法代码及TTD数据集。