Given 2D point correspondences between an image pair, inferring the camera motion is a fundamental issue in the computer vision community. The existing works generally set out from the epipolar constraint and estimate the essential matrix, which is not optimal in the maximum likelihood (ML) sense. In this paper, we dive into the original measurement model with respect to the rotation matrix and normalized translation vector and formulate the ML problem. We then propose a two-step algorithm to solve it: In the first step, we estimate the variance of measurement noises and devise a consistent estimator based on bias elimination; In the second step, we execute a one-step Gauss-Newton iteration on manifold to refine the consistent estimate. We prove that the proposed estimate owns the same asymptotic statistical properties as the ML estimate: The first is consistency, i.e., the estimate converges to the ground truth as the point number increases; The second is asymptotic efficiency, i.e., the mean squared error of the estimate converges to the theoretical lower bound -- Cramer-Rao bound. In addition, we show that our algorithm has linear time complexity. These appealing characteristics endow our estimator with a great advantage in the case of dense point correspondences. Experiments on both synthetic data and real images demonstrate that when the point number reaches the order of hundreds, our estimator outperforms the state-of-the-art ones in terms of estimation accuracy and CPU time.
翻译:给定图像对间的二维点对应关系,推断摄像机运动是计算机视觉领域的基本问题。现有方法通常基于对极约束估计本质矩阵,但该过程在最大似然意义上并非最优。本文深入研究了涉及旋转矩阵与归一化平移向量的原始测量模型,并构建了最大似然问题。我们提出了一种两步求解算法:第一步,估计测量噪声方差并设计基于偏差消除的一致估计量;第二步,在流形上执行一步高斯-牛顿迭代以优化一致估计。我们证明了所提估计量具有与最大似然估计量相同的渐近统计性质:其一是一致性,即估计值随点数量增加收敛于真实值;其二是渐近有效性,即估计均方误差收敛于理论下界——克拉美-罗界。此外,我们证明该算法具有线性时间复杂度。这些特性使得我们的估计器在密集点对应场景中具有显著优势。合成数据与真实图像的实验表明,当点对数量达到数百量级时,本方法在估计精度与CPU耗时方面均优于现有最优方法。