Perspective-1-Ellipsoid: Formulation, Analysis and Solutions of the Camera Pose Estimation Problem from One Ellipse-Ellipsoid Correspondence

In computer vision, camera pose estimation from correspondences between 3D geometric entities and their projections into the image has been a widely investigated problem. Although most state-of-the-art methods exploit low-level primitives such as points or lines, the emergence of very effective CNN-based object detectors in the recent years has paved the way to the use of higher-level features carrying semantically meaningful information. Pioneering works in that direction have shown that modelling 3D objects by ellipsoids and 2D detections by ellipses offers a convenient manner to link 2D and 3D data. However, the mathematical formalism most often used in the related litterature does not enable to easily distinguish ellipsoids and ellipses from other quadrics and conics, leading to a loss of specificity potentially detrimental in some developments. Moreover, the linearization process of the projection equation creates an over-representation of the camera parameters, also possibly causing an efficiency loss. In this paper, we therefore introduce an ellipsoid-specific theoretical framework and demonstrate its beneficial properties in the context of pose estimation. More precisely, we first show that the proposed formalism enables to reduce the pose estimation problem to a position or orientation-only estimation problem in which the remaining unknowns can be derived in closed-form. Then, we demonstrate that it can be further reduced to a 1 Degree-of-Freedom (1DoF) problem and provide the analytical derivations of the pose as a function of that unique scalar unknown. We illustrate our theoretical considerations by visual examples and include a discussion on the practical aspects. Finally, we release this paper along with the corresponding source code in order to contribute towards more efficient resolutions of ellipsoid-related pose estimation problems.

翻译：在计算机视觉中，基于三维几何实体与其图像投影之间的对应关系进行相机位姿估计是一个被广泛研究的问题。尽管多数主流方法利用点或线等低级基元，但近年来高效基于CNN的目标检测器的出现，为采用携带语义信息的更高级特征开辟了道路。该方向的早期开创性工作表明，用椭球建模三维物体、用椭圆建模二维检测结果，是连接二维与三维数据的一种便捷方式。然而，相关文献中常用的数学形式化方法难以清晰区分椭球/椭圆与其他二次曲面/二次曲线，这导致了特定性缺失，可能对某些研究有不利影响。此外，投影方程的线性化过程会造成相机参数的过度表征，也可能导致效率损失。为此，本文提出一个专用于椭球的理论框架，并证明其在位姿估计中的有益特性。具体而言：首先，我们证明该形式化方法能将位姿估计问题简化为仅需估计位置或方向的优化问题，且剩余未知量可闭式求解；其次，进一步证明该问题可降阶为单自由度（1DoF）问题，并给出位姿关于该唯一标量未知量的解析表达式。我们通过可视化示例阐明理论分析，并讨论实际应用细节。最后，为促进椭球相关位姿估计问题的高效求解，我们同步发布本文及对应源代码。