We present a learning framework for reconstructing neural scene representations from a small number of unconstrained tourist photos. Since each image contains transient occluders, decomposing the static and transient components is necessary to construct radiance fields with such in-the-wild photographs where existing methods require a lot of training data. We introduce SF-NeRF, aiming to disentangle those two components with only a few images given, which exploits semantic information without any supervision. The proposed method contains an occlusion filtering module that predicts the transient color and its opacity for each pixel, which enables the NeRF model to solely learn the static scene representation. This filtering module learns the transient phenomena guided by pixel-wise semantic features obtained by a trainable image encoder that can be trained across multiple scenes to learn the prior of transient objects. Furthermore, we present two techniques to prevent ambiguous decomposition and noisy results of the filtering module. We demonstrate that our method outperforms state-of-the-art novel view synthesis methods on Phototourism dataset in a few-shot setting.
翻译:我们提出了一种学习框架,用于从少量不受约束的游客照片中重建神经场景表示。由于每张图像包含瞬态遮挡物,在非受控照片中构建辐射场时,必须分解静态与瞬态组件,而现有方法需要大量训练数据才能实现。我们引入SF-NeRF,旨在仅用少量给定的图像即可在无需任何监督的情况下解耦这两个组件,并利用语义信息。该方法包含一个遮挡过滤模块,该模块为每个像素预测瞬态颜色及其不透明度,从而使NeRF模型能够仅学习静态场景表示。该过滤模块通过可训练的图像编码器获取像素级语义特征,由此学习瞬态现象,该编码器可跨多个场景训练以学习瞬态物体的先验知识。此外,我们提出了两种技术来防止过滤模块产生模糊分解和噪声结果。我们证明,在Photo Tourism数据集的少样本设定下,我们的方法在新视角合成方面优于现有最先进方法。