Inferring a meaningful geometric scene representation from a single image is a fundamental problem in computer vision. Approaches based on traditional depth map prediction can only reason about areas that are visible in the image. Currently, neural radiance fields (NeRFs) can capture true 3D including color, but are too complex to be generated from a single image. As an alternative, we propose to predict implicit density fields. A density field maps every location in the frustum of the input image to volumetric density. By directly sampling color from the available views instead of storing color in the density field, our scene representation becomes significantly less complex compared to NeRFs, and a neural network can predict it in a single forward pass. The prediction network is trained through self-supervision from only video data. Our formulation allows volume rendering to perform both depth prediction and novel view synthesis. Through experiments, we show that our method is able to predict meaningful geometry for regions that are occluded in the input image. Additionally, we demonstrate the potential of our approach on three datasets for depth prediction and novel-view synthesis.
翻译:从单张图像中推断有意义的几何场景表示是计算机视觉的基本问题。基于传统深度图预测的方法只能对图像中可见区域进行推理。当前,神经辐射场(NeRF)能够捕捉包括颜色在内的真实三维信息,但结构过于复杂以致无法从单张图像生成。为此,我们提出预测隐式密度场作为替代方案。密度场将输入图像视锥内的每个位置映射为体密度。通过直接从可用视图中采样颜色而非在密度场中存储颜色,我们的场景表示相比NeRF显著降低复杂度,神经网络可通过单次前向传播进行预测。该预测网络仅通过视频数据的自监督方式训练。我们的公式允许体渲染同时执行深度预测和新视角合成。实验表明,该方法能够对输入图像中遮挡区域预测有意义的几何结构。此外,我们在三个数据集上展示了该方法在深度预测和新视角合成方面的潜力。