Accurate monocular depth estimation is a fundamental component of vision-based perception systems in intelligent transportation applications. Despite recent progress, unsupervised monocular approaches still suffer from significant performance degradation in real-world traffic scenes due to synthetic-to-real domain gaps and the presence of dynamic, non-rigid objects such as vehicles and pedestrians. In this paper, we propose Back2Color, a robust unsupervised monocular depth estimation framework that addresses these challenges through domain adaptation and uncertainty-aware fusion. Specifically, Back2Color proposes a bidirectional depth-to-color transformation strategy that learns appearance mappings from real-world driving data and applies them to synthetic depth maps, thereby constructing training samples with realistic color appearance and paired synthetic depth. In this way, the proposed approach effectively reduces the domain gap between simulated and real traffic scenes, enabling the depth prediction network to learn more stable and generalizable priors. To further improve robustness under dynamic environments, we propose an auto-learning uncertainty temporal-spatial fusion (Auto-UTSF) module, which adaptively fuses complementary temporal and spatial cues by estimating pixel-wise uncertainty, enabling reliable depth prediction in the presence of moving objects and occlusions. Extensive experiments on challenging urban driving benchmarks, including KITTI and Cityscapes, demonstrate that the proposed method consistently outperforms existing unsupervised monocular depth estimation approaches, particularly in dynamic traffic scenarios, while maintaining high computational efficiency.
翻译:精确的单目深度估计是智能交通应用中基于视觉的感知系统的核心组成部分。尽管近期取得了进展,但由于合成到真实领域的差异以及车辆、行人等动态非刚性物体的存在,无监督单目方法在真实交通场景中仍面临显著的性能下降。本文提出Back2Color,一种鲁棒的无监督单目深度估计框架,通过领域自适应和不确定性感知融合来解决这些挑战。具体而言,Back2Color提出了一种双向深度到颜色的转换策略,该策略从真实世界驾驶数据中学习外观映射,并将其应用于合成深度图,从而构建具有真实色彩外观和配对合成深度的训练样本。通过这种方式,所提出的方法有效缩小了模拟与真实交通场景之间的领域差异,使深度预测网络能够学习更稳定且可泛化的先验知识。为进一步提升动态环境下的鲁棒性,我们提出了一种自动学习不确定性时空融合(Auto-UTSF)模块,该模块通过估计像素级不确定性自适应地融合互补的时空线索,从而在存在运动物体和遮挡的情况下实现可靠的深度预测。在KITTI和Cityscapes等具有挑战性的城市驾驶基准数据集上进行的大量实验表明,所提出的方法在保持高计算效率的同时,始终优于现有的无监督单目深度估计方法,尤其在动态交通场景中表现突出。