We propose a novel visual re-localization method based on direct matching between the implicit 3D descriptors and the 2D image with transformer. A conditional neural radiance field(NeRF) is chosen as the 3D scene representation in our pipeline, which supports continuous 3D descriptors generation and neural rendering. By unifying the feature matching and the scene coordinate regression to the same framework, our model learns both generalizable knowledge and scene prior respectively during two training stages. Furthermore, to improve the localization robustness when domain gap exists between training and testing phases, we propose an appearance adaptation layer to explicitly align styles between the 3D model and the query image. Experiments show that our method achieves higher localization accuracy than other learning-based approaches on multiple benchmarks. Code is available at \url{https://github.com/JenningsL/nerf-loc}.
翻译:我们提出一种新的视觉重定位方法,该方法基于隐式三维描述符与二维图像之间的直接匹配,并引入Transformer结构。在流程中,我们选择条件神经辐射场作为三维场景表示,支持连续的三维描述符生成与神经渲染。通过将特征匹配与场景坐标回归统一至同一框架,模型在两个训练阶段中分别学习通用知识与场景先验。此外,为提升训练与测试阶段存在域差异时的定位鲁棒性,我们提出一种外观自适应层,显式对齐三维模型与查询图像之间的风格。实验表明,在多个基准测试中,我们的方法相比其他基于学习方法取得了更高的定位精度。代码开源地址:\url{https://github.com/JenningsL/nerf-loc}。