In this work, we propose the use of Neural Radiance Fields (NeRF) as a scene representation for visual localization. Recently, NeRF has been employed to enhance pose regression and scene coordinate regression models by augmenting the training database, providing auxiliary supervision through rendered images, or serving as an iterative refinement module. We extend its recognized advantages -- its ability to provide a compact scene representation with realistic appearances and accurate geometry -- by exploring the potential of NeRF's internal features in establishing precise 2D-3D matches for localization. To this end, we conduct a comprehensive examination of NeRF's implicit knowledge, acquired through view synthesis, for matching under various conditions. This includes exploring different matching network architectures, extracting encoder features at multiple layers, and varying training configurations. Significantly, we introduce NeRFMatch, an advanced 2D-3D matching function that capitalizes on the internal knowledge of NeRF learned via view synthesis. Our evaluation of NeRFMatch on standard localization benchmarks, within a structure-based pipeline, sets a new state-of-the-art for localization performance on Cambridge Landmarks.
翻译:本文提出将神经辐射场(Neural Radiance Fields, NeRF)作为场景表示用于视觉定位。近期,NeRF通过扩充训练数据库、利用渲染图像提供辅助监督或作为迭代优化模块,已被用于增强姿态回归和场景坐标回归模型。我们扩展其公认优势——以逼真外观和精确几何提供紧凑场景表示的能力——通过探索NeRF内部特征在建立精确2D-3D匹配用于定位的潜力。为此,我们全面考察了NeRF通过视图合成习得的隐式知识在不同条件下的匹配能力,包括探索不同匹配网络架构、提取多层编码器特征以及改变训练配置。值得注意的是,我们提出了NeRFMatch,一种利用NeRF通过视图合成学习到的内部知识的先进2D-3D匹配函数。在基于结构的流程中,我们在标准定位基准上对NeRFMatch的评估,在剑桥地标数据集上取得了定位性能的最新最优结果。