This paper focuses on training a robust RGB-D registration model without ground-truth pose supervision. Existing methods usually adopt a pairwise training strategy based on differentiable rendering, which enforces the photometric and the geometric consistency between the two registered frames as supervision. However, this frame-to-frame framework suffers from poor multi-view consistency due to factors such as lighting changes, geometry occlusion and reflective materials. In this paper, we present NeRF-UR, a novel frame-to-model optimization framework for unsupervised RGB-D registration. Instead of frame-to-frame consistency, we leverage the neural radiance field (NeRF) as a global model of the scene and use the consistency between the input and the NeRF-rerendered frames for pose optimization. This design can significantly improve the robustness in scenarios with poor multi-view consistency and provides better learning signal for the registration model. Furthermore, to bootstrap the NeRF optimization, we create a synthetic dataset, Sim-RGBD, through a photo-realistic simulator to warm up the registration model. By first training the registration model on Sim-RGBD and later unsupervisedly fine-tuning on real data, our framework enables distilling the capability of feature extraction and registration from simulation to reality. Our method outperforms the state-of-the-art counterparts on two popular indoor RGB-D datasets, ScanNet and 3DMatch. Code and models will be released for paper reproduction.
翻译:本文聚焦于在无真实位姿监督的情况下训练鲁棒的RGB-D配准模型。现有方法通常采用基于可微分渲染的成对训练策略,通过强制两配准帧间的光度一致性与几何一致性作为监督信号。然而,这种帧到帧的框架因光照变化、几何遮挡及反射材质等因素,存在多视角一致性较差的问题。本文提出NeRF-UR,一种新颖的用于无监督RGB-D配准的帧到模型优化框架。我们不再依赖帧间一致性,而是利用神经辐射场(NeRF)作为场景的全局模型,并基于输入帧与NeRF重渲染帧之间的一致性进行位姿优化。该设计能显著提升在多视角一致性较差场景下的鲁棒性,并为配准模型提供更优的学习信号。此外,为引导NeRF优化,我们通过高真实感模拟器创建合成数据集Sim-RGBD,用于预热配准模型。通过先在Sim-RGBD上训练配准模型,再在真实数据上进行无监督微调,本框架实现了从模拟环境到真实场景的特征提取与配准能力迁移。我们的方法在两个主流室内RGB-D数据集ScanNet和3DMatch上超越了现有最优方法。代码与模型将开源以供论文复现。