This paper focuses on self-supervised monocular depth estimation in dynamic scenes trained on monocular videos. Existing methods jointly estimate pixel-wise depth and motion, relying mainly on an image reconstruction loss. Dynamic regions1 remain a critical challenge for these methods due to the inherent ambiguity in depth and motion estimation, resulting in inaccurate depth estimation. This paper proposes a self-supervised training framework exploiting pseudo depth labels for dynamic regions from training data. The key contribution of our framework is to decouple depth estimation for static and dynamic regions of images in the training data. We start with an unsupervised depth estimation approach, which provides reliable depth estimates for static regions and motion cues for dynamic regions and allows us to extract moving object information at the instance level. In the next stage, we use an object network to estimate the depth of those moving objects assuming rigid motions. Then, we propose a new scale alignment module to address the scale ambiguity between estimated depths for static and dynamic regions. We can then use the depth labels generated to train an end-to-end depth estimation network and improve its performance. Extensive experiments on the Cityscapes and KITTI datasets show that our self-training strategy consistently outperforms existing self/unsupervised depth estimation methods.
翻译:本文聚焦于基于单目视频训练的动态场景自监督单目深度估计任务。现有方法通过联合逐像素深度与运动估计,主要依赖图像重建损失函数。由于深度与运动估计的内在模糊性,动态区域对这些方法构成关键挑战,导致深度估计不准确。本文提出一种自监督训练框架,通过挖掘训练数据中动态区域的伪深度标签来改进性能。该框架的核心贡献在于解耦训练图像中静态区域与动态区域的深度估计。我们首先采用无监督深度估计方法,该方法为静态区域提供可靠的深度估计,同时提取动态区域的运动线索,并能在实例级别获取运动物体信息。在第二阶段,我们利用物体网络在刚性运动假设下估计这些运动物体的深度。随后提出一种新型尺度对齐模块,解决静态与动态区域深度估计之间的尺度模糊性。通过生成的深度标签训练端到端深度估计网络,有效提升其性能。在Cityscapes和KITTI数据集上的大量实验证明,我们的自训练策略持续优于现有自监督/无监督深度估计方法。