Monocular depth estimation is an ongoing challenge in computer vision. Recent progress with Transformer models has demonstrated notable advantages over conventional CNNs in this area. However, there's still a gap in understanding how these models prioritize different regions in 2D images and how these regions affect depth estimation performance. To explore the differences between Transformers and CNNs, we employ a sparse pixel approach to contrastively analyze the distinctions between the two. Our findings suggest that while Transformers excel in handling global context and intricate textures, they lag behind CNNs in preserving depth gradient continuity. To further enhance the performance of Transformer models in monocular depth estimation, we propose the Depth Gradient Refinement (DGR) module that refines depth estimation through high-order differentiation, feature fusion, and recalibration. Additionally, we leverage optimal transport theory, treating depth maps as spatial probability distributions, and employ the optimal transport distance as a loss function to optimize our model. Experimental results demonstrate that models integrated with the plug-and-play Depth Gradient Refinement (DGR) module and the proposed loss function enhance performance without increasing complexity and computational costs. This research not only offers fresh insights into the distinctions between Transformers and CNNs in depth estimation but also paves the way for novel depth estimation methodologies.
翻译:单目深度估计是计算机视觉中的持续挑战。近年来,Transformer模型在该领域的进展展现出相较于传统CNN的显著优势。然而,目前对这些模型如何优先处理二维图像不同区域及其对深度估计性能的影响机制仍存在认知空白。为探索Transformer与CNN之间的差异,我们采用稀疏像素方法对两者进行对比分析。研究发现,尽管Transformer在处理全局上下文和复杂纹理方面表现优异,但在保持深度梯度连续性方面落后于CNN。为进一步提升Transformer模型在单目深度估计中的性能,我们提出深度梯度优化模块,通过高阶微分、特征融合与重标定机制优化深度估计。此外,我们利用最优传输理论,将深度图视为空间概率分布,并采用最优传输距离作为损失函数优化模型。实验结果表明,集成即插即用的深度梯度优化模块与所提损失函数的模型,在无需增加复杂度与计算成本的前提下提升了性能。本研究不仅为Transformer与CNN在深度估计任务中的差异提供了新见解,亦为新型深度估计方法开辟了研究路径。