Photometric differences are widely used as supervision signals to train neural networks for estimating depth and camera pose from unlabeled monocular videos. However, this approach is detrimental for model optimization because occlusions and moving objects in a scene violate the underlying static scenario assumption. In addition, pixels in textureless regions or less discriminative pixels hinder model training. To solve these problems, in this paper, we deal with moving objects and occlusions utilizing the difference of the flow fields and depth structure generated by affine transformation and view synthesis, respectively. Secondly, we mitigate the effect of textureless regions on model optimization by measuring differences between features with more semantic and contextual information without adding networks. In addition, although the bidirectionality component is used in each sub-objective function, a pair of images are reasoned about only once, which helps reduce overhead. Extensive experiments and visual analysis demonstrate the effectiveness of the proposed method, which outperform existing state-of-the-art self-supervised methods under the same conditions and without introducing additional auxiliary information.
翻译:光度差异被广泛用作监督信号来训练神经网络,以从无标注的单目视频中估计深度和相机位姿。然而,这种方法对模型优化不利,因为场景中的遮挡和运动物体违反了底层静态场景假设。此外,无纹理区域中的像素或判别性较弱的像素会阻碍模型训练。为解决这些问题,本文分别利用仿射变换生成的流场差异和视图合成产生的深度结构来处理运动物体和遮挡。其次,通过在不增加网络的情况下衡量具有更多语义和上下文信息的特征之间的差异,来减轻无纹理区域对模型优化的影响。此外,尽管每个子目标函数中都使用了双向性组件,但每对图像仅被推理一次,这有助于减少开销。大量实验和可视化分析证明了所提出方法的有效性,在相同条件下且未引入额外辅助信息时,该方法优于现有最先进的自监督方法。