Self-supervised monocular depth estimation is a salient task for 3D scene understanding. Learned jointly with monocular ego-motion estimation, several methods have been proposed to predict accurate pixel-wise depth without using labeled data. Nevertheless, these methods focus on improving performance under ideal conditions without natural or digital corruptions. The general absence of occlusions is assumed even for object-specific depth estimation. These methods are also vulnerable to adversarial attacks, which is a pertinent concern for their reliable deployment in robots and autonomous driving systems. We propose MIMDepth, a method that adapts masked image modeling (MIM) for self-supervised monocular depth estimation. While MIM has been used to learn generalizable features during pre-training, we show how it could be adapted for direct training of monocular depth estimation. Our experiments show that MIMDepth is more robust to noise, blur, weather conditions, digital artifacts, occlusions, as well as untargeted and targeted adversarial attacks.
翻译:自监督单目深度估计是三维场景理解中的一项重要任务。通过与单目自运动估计联合学习,现有多种方法能够在无需标注数据的情况下预测精确的逐像素深度。然而,这些方法主要关注理想条件下的性能提升,并未考虑自然或数字形式的图像退化。即便是针对特定物体的深度估计,也普遍假设不存在遮挡。此外,这些方法对对抗性攻击较为脆弱,这对其在机器人和自动驾驶系统中的可靠部署构成了关键挑战。我们提出MIMDepth方法,该方法将掩码图像建模(MIM)自适应应用于自监督单目深度估计。尽管MIM此前多用于预训练阶段学习通用特征,我们展示了如何将其直接用于单目深度估计的训练过程。实验表明,MIMDepth对噪声、模糊、天气条件、数字伪影、遮挡以及无目标和有目标对抗攻击均具有更强的鲁棒性。