Monocular depth estimation (MDE) in the self-supervised scenario has emerged as a promising method as it refrains from the requirement of ground truth depth. Despite continuous efforts, MDE is still sensitive to scale changes especially when all the training samples are from one single camera. Meanwhile, it deteriorates further since camera movement results in heavy coupling between the predicted depth and the scale change. In this paper, we present a scale-invariant approach for self-supervised MDE, in which scale-sensitive features (SSFs) are detached away while scale-invariant features (SIFs) are boosted further. To be specific, a simple but effective data augmentation by imitating the camera zooming process is proposed to detach SSFs, making the model robust to scale changes. Besides, a dynamic cross-attention module is designed to boost SIFs by fusing multi-scale cross-attention features adaptively. Extensive experiments on the KITTI dataset demonstrate that the detaching and boosting strategies are mutually complementary in MDE and our approach achieves new State-of-The-Art performance against existing works from 0.097 to 0.090 w.r.t absolute relative error. The code will be made public soon.
翻译:自监督场景下的单目深度估计(MDE)因无需真实深度标注而成为一种有前景的方法。尽管不断有相关研究推进,但MDE对尺度变化仍然敏感,尤其当所有训练样本均来自单一相机时。此外,由于相机运动导致预测深度与尺度变化之间高度耦合,该问题进一步恶化。本文提出一种面向自监督MDE的尺度不变方法,其中尺度敏感特征(SSFs)被分离剔除,而尺度不变特征(SIFs)得到进一步增强。具体而言,我们提出一种简单但有效的通过模拟相机变焦过程的数据增强方法,以分离SSFs,使模型对尺度变化具有鲁棒性。同时,设计了一种动态交叉注意力模块,通过自适应融合多尺度交叉注意力特征来增强SIFs。在KITTI数据集上的大量实验表明,分离与增强策略在MDE中具有相互补充性,我们的方法在绝对相对误差指标上从0.097提升至0.090,超越了现有方法,达到了新的最优性能。代码将很快公开。