Monocular depth estimation is very challenging because clues to the exact depth are incomplete in a single RGB image. To overcome the limitation, deep neural networks rely on various visual hints such as size, shade, and texture extracted from RGB information. However, we observe that if such hints are overly exploited, the network can be biased on RGB information without considering the comprehensive view. We propose a novel depth estimation model named RElative Depth Transformer (RED-T) that uses relative depth as guidance in self-attention. Specifically, the model assigns high attention weights to pixels of close depth and low attention weights to pixels of distant depth. As a result, the features of similar depth can become more likely to each other and thus less prone to misused visual hints. We show that the proposed model achieves competitive results in monocular depth estimation benchmarks and is less biased to RGB information. In addition, we propose a novel monocular depth estimation benchmark that limits the observable depth range during training in order to evaluate the robustness of the model for unseen depths.
翻译:单目深度估计极具挑战性,因为单张RGB图像中缺乏精确深度的完整线索。为克服这一局限,深度神经网络依赖于从RGB信息中提取的尺寸、阴影和纹理等视觉线索。然而,我们观察到若过度利用这些线索,网络可能对RGB信息产生偏置而忽略全局视角。本文提出名为相对深度变换器(RED-T)的新型深度估计模型,该模型在自注意力机制中以相对深度作为引导。具体而言,模型对邻近深度的像素赋予高注意力权重,而对遥远深度的像素赋予低注意力权重。这使得相似深度的特征彼此更趋近,从而减少对视觉线索的误用。实验表明,所提模型在单目深度估计基准测试中取得具有竞争力的结果,且对RGB信息的偏置更小。此外,我们提出一种新的单目深度估计基准测试,通过限制训练期间可观测的深度范围来评估模型对未见深度的鲁棒性。