Monocular depth estimation from RGB images plays a pivotal role in 3D vision. However, its accuracy can deteriorate in challenging environments such as nighttime or adverse weather conditions. While long-wave infrared cameras offer stable imaging in such challenging conditions, they are inherently low-resolution, lacking rich texture and semantics as delivered by the RGB image. Current methods focus solely on a single modality due to the difficulties to identify and integrate faithful depth cues from both sources. To address these issues, this paper presents a novel approach that identifies and integrates dominant cross-modality depth features with a learning-based framework. Concretely, we independently compute the coarse depth maps with separate networks by fully utilizing the individual depth cues from each modality. As the advantageous depth spreads across both modalities, we propose a novel confidence loss steering a confidence predictor network to yield a confidence map specifying latent potential depth areas. With the resulting confidence map, we propose a multi-modal fusion network that fuses the final depth in an end-to-end manner. Harnessing the proposed pipeline, our method demonstrates the ability of robust depth estimation in a variety of difficult scenarios. Experimental results on the challenging MS$^2$ and ViViD++ datasets demonstrate the effectiveness and robustness of our method.
翻译:单目RGB图像深度估计在三维视觉中扮演关键角色。然而,在夜间或恶劣天气等挑战性环境下,其精度可能显著下降。尽管长波红外相机可在此类环境中稳定成像,但其本质低分辨率特性使其缺乏RGB图像所具备的丰富纹理与语义信息。由于难以从两种模态中识别并整合可靠的深度线索,现有方法通常仅聚焦单一模态。针对上述问题,本文提出一种创新方法,通过基于学习的框架识别并整合主导性跨模态深度特征。具体而言,我们利用各模态独立的深度线索,分别通过独立网络计算粗深度图。鉴于优势深度信息分布于两种模态之间,我们提出一种新型置信度损失函数,驱动置信度预测网络生成描述潜在深度区域的置信度图。基于所获置信度图,我们进一步提出端到端融合最终深度的多模态融合网络。通过该流水线,本方法在多种困难场景中展现出稳健的深度估计能力。在具有挑战性的MS$^2$与ViViD++数据集上的实验结果验证了本方法的有效性与鲁棒性。