Monocular depth estimation from a single image is an ill-posed problem for computer vision due to insufficient reliable cues as the prior knowledge. Besides the inter-frame supervision, namely stereo and adjacent frames, extensive prior information is available in the same frame. Reflections from specular surfaces, informative intra-frame priors, enable us to reformulate the ill-posed depth estimation task as a multi-view synthesis. This paper proposes the first self-supervision for deep-learning depth estimation on water scenes via intra-frame priors, known as reflection supervision and geometrical constraints. In the first stage, a water segmentation network is performed to separate the reflection components from the entire image. Next, we construct a self-supervised framework to predict the target appearance from reflections, perceived as other perspectives. The photometric re-projection error, incorporating SmoothL1 and a novel photometric adaptive SSIM, is formulated to optimize pose and depth estimation by aligning the transformed virtual depths and source ones. As a supplement, the water surface is determined from real and virtual camera positions, which complement the depth of the water area. Furthermore, to alleviate these laborious ground truth annotations, we introduce a large-scale water reflection scene (WRS) dataset rendered from Unreal Engine 4. Extensive experiments on the WRS dataset prove the feasibility of the proposed method compared to state-of-the-art depth estimation techniques.
翻译:从单张图像进行单目深度估计是计算机视觉中的一个不适定问题,因为缺乏足够的可靠线索作为先验知识。除了帧间监督(即立体图像和相邻帧)外,同一帧内也存在大量先验信息。来自镜面表面的反射作为信息丰富的帧内先验,使我们能够将不适定的深度估计任务重新表述为多视角合成。本文首次提出通过帧内先验(即反射监督和几何约束)对水面场景的深度学习深度估计进行自监督。在第一阶段,执行水面分割网络以从整幅图像中分离反射分量。接下来,我们构建一个自监督框架,从被视为其他视角的反射中预测目标外观。光度量重投影误差(结合了SmoothL1和一种新颖的光度自适应SSIM)被用于通过对齐变换后的虚拟深度与源深度来优化位姿和深度估计。作为补充,水面由真实和虚拟相机位置确定,从而完善水面区域的深度。此外,为减轻繁琐的真实标注工作,我们引入了一个基于虚幻引擎4渲染的大规模水面反射场景(WRS)数据集。在WRS数据集上的大量实验证明了所提方法相较于最先进的深度估计技术的可行性。