Estimating depth from a sequence of posed RGB images is a fundamental computer vision task, with applications in augmented reality, path planning etc. Prior work typically makes use of previous frames in a multi view stereo framework, relying on matching textures in a local neighborhood. In contrast, our model leverages historical predictions by giving the latest 3D geometry data as an extra input to our network. This self-generated geometric hint can encode information from areas of the scene not covered by the keyframes and it is more regularized when compared to individual predicted depth maps for previous frames. We introduce a Hint MLP which combines cost volume features with a hint of the prior geometry, rendered as a depth map from the current camera location, together with a measure of the confidence in the prior geometry. We demonstrate that our method, which can run at interactive speeds, achieves state-of-the-art estimates of depth and 3D scene reconstruction in both offline and incremental evaluation scenarios.
翻译:从一系列姿态已知的RGB图像序列中估计深度是计算机视觉的一项基础任务,在增强现实、路径规划等领域具有重要应用。先前的研究通常在多视图立体框架中利用先前帧,依赖于局部邻域内的纹理匹配。与之相反,我们的模型通过将最新的三维几何数据作为网络的额外输入,从而利用历史预测信息。这种自生成的几何提示能够编码关键帧未覆盖的场景区域信息,并且与先前帧各自预测的深度图相比,其正则化程度更高。我们引入了一个提示多层感知机,它将代价体特征与先验几何提示(从当前相机位置渲染为深度图)以及先验几何的置信度度量相结合。我们证明,我们的方法能够以交互速度运行,并在离线和增量评估场景中均实现了深度估计与三维场景重建的先进水平。