Locating 3D objects from a single RGB image via Perspective-n-Point (PnP) is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest interpreting PnP as a differentiable layer, allowing for partial learning of 2D-3D point correspondences by backpropagating the gradients of pose loss. Yet, learning the entire correspondences from scratch is highly challenging, particularly for ambiguous pose solutions, where the globally optimal pose is theoretically non-differentiable w.r.t. the points. In this paper, we propose the EPro-PnP, a probabilistic PnP layer for general end-to-end pose estimation, which outputs a distribution of pose with differentiable probability density on the SE(3) manifold. The 2D-3D coordinates and corresponding weights are treated as intermediate variables learned by minimizing the KL divergence between the predicted and target pose distribution. The underlying principle generalizes previous approaches, and resembles the attention mechanism. EPro-PnP can enhance existing correspondence networks, closing the gap between PnP-based method and the task-specific leaders on the LineMOD 6DoF pose estimation benchmark. Furthermore, EPro-PnP helps to explore new possibilities of network design, as we demonstrate a novel deformable correspondence network with the state-of-the-art pose accuracy on the nuScenes 3D object detection benchmark. Our code is available at https://github.com/tjiiv-cprg/EPro-PnP-v2.
翻译:通过透视n点算法(Perspective-n-Point, PnP)从单张RGB图像中定位三维物体是计算机视觉领域的一个长期难题。受端到端深度学习驱动,近期研究表明可将PnP解释为可微分层,通过反向传播姿态损失的梯度实现2D-3D点对应关系的部分学习。然而,完全从零开始学习所有对应关系极具挑战性,尤其在存在歧义性姿态解的情况下——此时全局最优姿态在理论上对于点坐标不可微分。本文提出EPro-PnP,一种用于通用端到端姿态估计的概率化PnP层,可在SE(3)流形上输出具有可微分概率密度的姿态分布。其中,2D-3D坐标及对应权重被视为中间变量,通过最小化预测姿态分布与目标姿态分布之间的KL散度进行学习。该基本原理既泛化了先前方法,又类似于注意力机制。EPro-PnP能够增强现有对应关系网络,在LineMOD六自由度姿态估计基准测试中缩小了基于PnP的方法与任务特定领先方法之间的差距。此外,EPro-PnP有助于探索网络设计的新可能性——我们据此展示了在nuScenes三维物体检测基准测试中达到最优姿态精度的新型可变形对应关系网络。我们的代码开源在https://github.com/tjiiv-cprg/EPro-PnP-v2。