Virtual engines can generate dense depth maps for various synthetic scenes, making them invaluable for training depth estimation models. However, discrepancies between synthetic and real-world colors pose significant challenges for depth estimation in real-world scenes, especially in complex and uncertain environments encountered in unsupervised monocular depth estimation tasks. To address this issue, we propose Back2Color, a framework that predicts realistic colors from depth using a model trained on real-world data, thus transforming synthetic colors into their real-world counterparts. Additionally, we introduce the Syn-Real CutMix method for joint training with both real-world unsupervised and synthetic supervised depth samples, enhancing monocular depth estimation performance in real-world scenes. Furthermore, to mitigate the impact of non-rigid motions on depth estimation, we present an auto-learning uncertainty temporal-spatial fusion method (Auto-UTSF), which leverages the strengths of unsupervised learning in both temporal and spatial dimensions. We also designed VADepth, based on the Vision Attention Network, which offers lower computational complexity and higher accuracy than transformers. Our Back2Color framework achieves state-of-the-art performance on the Kitti dataset, as evidenced by improvements in performance metrics and the production of fine-grained details. This is particularly evident on more challenging datasets such as Cityscapes for unsupervised depth estimation.
翻译:虚拟引擎可为各类合成场景生成稠密深度图,使其在训练深度估计模型方面具有重要价值。然而,合成色彩与真实世界色彩之间的差异为现实场景中的深度估计带来了显著挑战,尤其是在无监督单目深度估计任务所面临的复杂不确定环境中。为解决此问题,我们提出Back2Color框架,该框架通过基于真实世界数据训练的模型从深度信息预测真实色彩,从而将合成色彩转换为对应的真实世界色彩。此外,我们提出Syn-Real CutMix方法,用于联合训练真实世界无监督样本与合成监督深度样本,以提升现实场景中的单目深度估计性能。进一步地,为减轻非刚性运动对深度估计的影响,我们提出自动学习不确定性时空融合方法(Auto-UTSF),该方法充分利用无监督学习在时间与空间维度的优势。我们还基于视觉注意力网络设计了VADepth模型,其计算复杂度低于Transformer架构且具有更高精度。我们的Back2Color框架在Kitti数据集上取得了最先进的性能,这通过性能指标的提升与细粒度细节的生成得以验证。该方法在更具挑战性的数据集(如Cityscapes的无监督深度估计任务)上表现尤为显著。