Unsupervised methods have showed promising results on monocular depth estimation. However, the training data must be captured in scenes without moving objects. To push the envelope of accuracy, recent methods tend to increase their model parameters. In this paper, an unsupervised learning framework is proposed to jointly predict monocular depth and complete 3D motion including the motions of moving objects and camera. (1) Recurrent modulation units are used to adaptively and iteratively fuse encoder and decoder features. This not only improves the single-image depth inference but also does not overspend model parameters. (2) Instead of using a single set of filters for upsampling, multiple sets of filters are devised for the residual upsampling. This facilitates the learning of edge-preserving filters and leads to the improved performance. (3) A warping-based network is used to estimate a motion field of moving objects without using semantic priors. This breaks down the requirement of scene rigidity and allows to use general videos for the unsupervised learning. The motion field is further regularized by an outlier-aware training loss. Despite the depth model just uses a single image in test time and 2.97M parameters, it achieves state-of-the-art results on the KITTI and Cityscapes benchmarks.
翻译:无监督方法在单目深度估计中已展现出令人瞩目的成果。然而,训练数据必须在无运动物体的场景中采集。为突破精度极限,近期方法倾向于增加模型参数量。本文提出一种无监督学习框架,用于联合预测单目深度及包括运动物体与相机运动在内的完整三维运动:(1)采用递归调制单元自适应迭代融合编码器与解码器特征,这不仅提升了单图像深度推理能力,且未过度消耗模型参数;(2)设计多组残差上采样滤波器替代单组滤波器,促进了边缘保持滤波器的学习,从而提升性能;(3)利用基于扭曲的网络,在无需语义先验的情况下估计运动物体的运动场,打破场景刚性约束,使通用视频可用于无监督学习。该运动场进一步通过基于离群点感知的训练损失进行正则化。尽管深度模型在测试时仅使用单张图像且参数量仅2.97M,其在KITTI和Cityscapes基准上仍取得了最先进的结果。