Self-supervised monocular depth estimation has been widely studied recently. Most of the work has focused on improving performance on benchmark datasets, such as KITTI, but has offered a few experiments on generalization performance. In this paper, we investigate the backbone networks (e.g. CNNs, Transformers, and CNN-Transformer hybrid models) toward the generalization of monocular depth estimation. We first evaluate state-of-the-art models on diverse public datasets, which have never been seen during the network training. Next, we investigate the effects of texture-biased and shape-biased representations using the various texture-shifted datasets that we generated. We observe that Transformers exhibit a strong shape bias and CNNs do a strong texture-bias. We also find that shape-biased models show better generalization performance for monocular depth estimation compared to texture-biased models. Based on these observations, we newly design a CNN-Transformer hybrid network with a multi-level adaptive feature fusion module, called MonoFormer. The design intuition behind MonoFormer is to increase shape bias by employing Transformers while compensating for the weak locality bias of Transformers by adaptively fusing multi-level representations. Extensive experiments show that the proposed method achieves state-of-the-art performance with various public datasets. Our method also shows the best generalization ability among the competitive methods.
翻译:自监督单目深度估计近年来得到了广泛研究。多数工作聚焦于在基准数据集(如KITTI)上提升性能,但关于泛化能力的实验却相对有限。本文研究了多种骨干网络(例如CNN、Transformer及CNN-Transformer混合模型)对单目深度估计泛化能力的影响。首先,我们在网络训练中从未见过的多个公开数据集上评估了当前最先进的模型。随后,利用我们生成的各类纹理偏移数据集,探讨了纹理偏好表征与形状偏好表征的影响。我们观察到Transformer具有显著的形状偏好,而CNN则呈现强烈的纹理偏好。同时发现,相较于纹理偏好模型,形状偏好模型在单目深度估计中展现出更优的泛化性能。基于上述发现,我们设计了一种新型CNN-Transformer混合网络,并引入多层级自适应特征融合模块,命名为MonoFormer。该网络的设计思路是通过引入Transformer增强形状偏好,同时利用自适应融合多层级表征弥补Transformer在局部性偏好上的不足。大量实验表明,所提方法在多个公开数据集上达到了最优性能,并在各类竞争方法中展现出最强的泛化能力。