Inferring a meaningful geometric scene representation from a single image is a fundamental problem in computer vision. Approaches based on traditional depth map prediction can only reason about areas that are visible in the image. Currently, neural radiance fields (NeRFs) can capture true 3D including color but are too complex to be generated from a single image. As an alternative, we introduce a neural network that predicts an implicit density field from a single image. It maps every location in the frustum of the image to volumetric density. Our network can be trained through self-supervision from only video data. By not storing color in the implicit volume, but directly sampling color from the available views during training, our scene representation becomes significantly less complex compared to NeRFs, and we can train neural networks to predict it. Thus, we can apply volume rendering to perform both depth prediction and novel view synthesis. In our experiments, we show that our method is able to predict meaningful geometry for regions that are occluded in the input image. Additionally, we demonstrate the potential of our approach on three datasets for depth prediction and novel-view synthesis.
翻译:从单张图像中推断出有意义的几何场景表示是计算机视觉中的一个基本问题。基于传统深度图预测的方法只能推理图像中可见的区域。而当前神经辐射场(NeRFs)可捕获包括颜色在内的真实三维信息,但其复杂度太高,难以通过单张图像直接生成。作为替代方案,我们引入一种神经网络,可从单张图像预测隐式密度场。该网络将图像视锥内的每个位置映射为体密度,并仅通过视频数据的自监督方式完成训练。由于隐式体积中不存储颜色信息,而是在训练时直接从可用视图中采样颜色,我们的场景表示复杂度相比NeRF显著降低,从而可训练神经网络进行预测。由此,我们可应用体渲染技术同时进行深度预测和新视角合成。实验表明,该方法能预测输入图像中遮挡区域的有效几何结构。此外,我们在三个数据集上展示了该方法在深度预测和新视角合成任务中的潜力。