Inferring scene geometry from images via Structure from Motion is a long-standing and fundamental problem in computer vision. While classical approaches and, more recently, depth map predictions only focus on the visible parts of a scene, the task of scene completion aims to reason about geometry even in occluded regions. With the popularity of neural radiance fields (NeRFs), implicit representations also became popular for scene completion by predicting so-called density fields. Unlike explicit approaches. e.g. voxel-based methods, density fields also allow for accurate depth prediction and novel-view synthesis via image-based rendering. In this work, we propose to fuse the scene reconstruction from multiple images and distill this knowledge into a more accurate single-view scene reconstruction. To this end, we propose Multi-View Behind the Scenes (MVBTS) to fuse density fields from multiple posed images, trained fully self-supervised only from image data. Using knowledge distillation, we use MVBTS to train a single-view scene completion network via direct supervision called KDBTS. It achieves state-of-the-art performance on occupancy prediction, especially in occluded regions.
翻译:从图像中通过运动恢复结构推断场景几何是计算机视觉中长期存在且基础的问题。虽然经典方法及近期的深度图预测仅关注场景的可见部分,但场景补全任务旨在推理被遮挡区域的几何结构。随着神经辐射场(NeRF)的普及,通过预测所谓的密度场进行场景补全的隐式表示方法也逐渐流行。与显式方法(如基于体素的方法)不同,密度场还能精确预测深度,并通过基于图像的渲染实现新视角合成。本研究提出融合多视图场景重建,并通过知识蒸馏将这一知识转化为更精确的单视图场景重建。为此,我们提出了多视图幕后方法(MVBTS),该方法仅从图像数据中通过完全自监督训练融合来自多张带位姿图像的密度场。利用知识蒸馏,我们通过直接监督使用MVBTS训练了名为KDBTS的单视图场景补全网络。该网络,特别是在被遮挡区域,在占据预测任务上达到了最先进的性能。