Transferring to Real-World Layouts: A Depth-aware Framework for Scene Adaptation

Scene segmentation via unsupervised domain adaptation (UDA) enables the transfer of knowledge acquired from source synthetic data to real-world target data, which largely reduces the need for manual pixel-level annotations in the target domain. To facilitate domain-invariant feature learning, existing methods typically mix data from both the source domain and target domain by simply copying and pasting the pixels. Such vanilla methods are usually sub-optimal since they do not take into account how well the mixed layouts correspond to real-world scenarios. Real-world scenarios are with an inherent layout. We observe that semantic categories, such as sidewalks, buildings, and sky, display relatively consistent depth distributions, and could be clearly distinguished in a depth map. Based on such observation, we propose a depth-aware framework to explicitly leverage depth estimation to mix the categories and facilitate the two complementary tasks, i.e., segmentation and depth learning in an end-to-end manner. In particular, the framework contains a Depth-guided Contextual Filter (DCF) forndata augmentation and a cross-task encoder for contextual learning. DCF simulates the real-world layouts, while the cross-task encoder further adaptively fuses the complementing features between two tasks. Besides, it is worth noting that several public datasets do not provide depth annotation. Therefore, we leverage the off-the-shelf depth estimation network to generate the pseudo depth. Extensive experiments show that our proposed methods, even with pseudo depth, achieve competitive performance on two widely-used bench-marks, i.e. 77.7 mIoU on GTA to Cityscapes and 69.3 mIoU on Synthia to Cityscapes.

翻译：通过无监督域自适应（UDA）进行场景分割，能够将源合成数据中习得的知识迁移至真实世界目标数据，从而大幅减少目标域中人工像素级标注的需求。为促进域不变特征学习，现有方法通常通过简单复制粘贴像素的方式混合源域与目标域数据。此类基础方法通常并非最优，因其未充分考虑混合后的布局与真实场景的匹配程度。真实场景具有固有的布局结构。我们观察到，语义类别（如人行道、建筑物、天空）呈现出相对一致的深度分布，并可在深度图中清晰区分。基于此观察，我们提出一种深度感知框架，显式利用深度估计来混合类别，并以端到端方式促进分割与深度学习这两项互补任务。具体而言，该框架包含一个用于数据增强的深度引导上下文滤波器（DCF）以及一个用于上下文学习的跨任务编码器。DCF模拟真实世界布局，而跨任务编码器则进一步自适应地融合两项任务间的互补特征。此外，需注意多个公开数据集未提供深度标注。因此，我们利用现成的深度估计网络生成伪深度。大量实验表明，即使使用伪深度，我们提出的方法仍在两个广泛使用的基准测试中取得了具有竞争力的性能：在GTA到Cityscapes上达到77.7 mIoU，在Synthia到Cityscapes上达到69.3 mIoU。