Object pose estimation is a fundamental computer vision task exploited in several robotics and augmented reality applications. Many established approaches rely on predicting 2D-3D keypoint correspondences using RANSAC (Random sample consensus) and estimating the object pose using the PnP (Perspective-n-Point) algorithm. Being RANSAC non-differentiable, correspondences cannot be directly learned in an end-to-end fashion. In this paper, we address the stereo image-based object pose estimation problem by i) introducing a differentiable RANSAC layer into a well-known monocular pose estimation network; ii) exploiting an uncertainty-driven multi-view PnP solver which can fuse information from multiple views. We evaluate our approach on a challenging public stereo object pose estimation dataset and a custom-built dataset we call Transparent Tableware Dataset (TTD), yielding state-of-the-art results against other recent approaches. Furthermore, in our ablation study, we show that the differentiable RANSAC layer plays a significant role in the accuracy of the proposed method. We release with this paper the code of our method and the TTD dataset.
翻译:物体位姿估计是计算机视觉中的一项基础任务,广泛应用于机器人技术和增强现实领域。许多成熟方法通过RANSAC(随机采样一致性)预测2D-3D关键点对应关系,并利用PnP(透视N点)算法估算物体位姿。由于RANSAC不可微,对应关系无法以端到端方式直接学习。本文针对基于立体图像的物体位姿估计问题,提出以下解决方案:i) 在经典单目位姿估计网络中引入可微分RANSAC层;ii) 利用基于不确定性驱动的多视角PnP求解器融合多视角信息。我们在公开的立体物体位姿估计数据集以及自建数据集(称为透明餐具数据集TTD)上评估了所提方法,相较于其他近期方法取得了领先结果。此外,消融研究表明,可微分RANSAC层对提升方法精度具有重要作用。本文同时公开了所提方法的代码及TTD数据集。