Camera localization methods based on retrieval, local feature matching, and 3D structure-based pose estimation are accurate but require high storage, are slow, and are not privacy-preserving. A method based on scene landmark detection (SLD) was recently proposed to address these limitations. It involves training a convolutional neural network (CNN) to detect a few predetermined, salient, scene-specific 3D points or landmarks and computing camera pose from the associated 2D-3D correspondences. Although SLD outperformed existing learning-based approaches, it was notably less accurate than 3D structure-based methods. In this paper, we show that the accuracy gap was due to insufficient model capacity and noisy labels during training. To mitigate the capacity issue, we propose to split the landmarks into subgroups and train a separate network for each subgroup. To generate better training labels, we propose using dense reconstructions to estimate visibility of scene landmarks. Finally, we present a compact architecture to improve memory efficiency. Accuracy wise, our approach is on par with state of the art structure based methods on the INDOOR-6 dataset but runs significantly faster and uses less storage. Code and models can be found at https://github.com/microsoft/SceneLandmarkLocalization.
翻译:基于检索、局部特征匹配和三维结构姿态估计的相机定位方法虽然准确,但存在存储需求高、运行速度慢且不保护隐私的问题。最近提出的场景地标检测(SLD)方法旨在解决这些局限性。该方法通过训练卷积神经网络(CNN)检测若干预设的、显著且场景特定的三维点或地标,并利用对应的二维-三维对应关系计算相机姿态。尽管SLD优于现有基于学习的方法,但其精度显著低于基于三维结构的方法。本文表明,这一精度差距源于训练过程中模型容量不足和标签噪声。为解决容量问题,我们提出将地标划分为子组,并为每个子组训练独立的网络。为生成更优的训练标签,我们提出利用密集重建估计场景地标的可见性。最后,我们设计了一种紧凑架构以提升内存效率。在INDOOR-6数据集上,我们的方法在精度上与最先进的基于结构的方法相当,但运行速度显著更快且存储需求更少。代码和模型详见https://github.com/microsoft/SceneLandmarkLocalization。