Unsupervised domain adaptation (UDA) and domain generalization (DG) enable machine learning models trained on a source domain to perform well on unlabeled or even unseen target domains. As previous UDA&DG semantic segmentation methods are mostly based on outdated networks, we benchmark more recent architectures, reveal the potential of Transformers, and design the DAFormer network tailored for UDA&DG. It is enabled by three training strategies to avoid overfitting to the source domain: While (1) Rare Class Sampling mitigates the bias toward common source domain classes, (2) a Thing-Class ImageNet Feature Distance and (3) a learning rate warmup promote feature transfer from ImageNet pretraining. As UDA&DG are usually GPU memory intensive, most previous methods downscale or crop images. However, low-resolution predictions often fail to preserve fine details while models trained with cropped images fall short in capturing long-range, domain-robust context information. Therefore, we propose HRDA, a multi-resolution framework for UDA&DG, that combines the strengths of small high-resolution crops to preserve fine segmentation details and large low-resolution crops to capture long-range context dependencies with a learned scale attention. DAFormer and HRDA significantly improve the state-of-the-art UDA&DG by more than 10 mIoU on 5 different benchmarks. The implementation is available at https://github.com/lhoyer/HRDA.
翻译:无监督域自适应(UDA)与域泛化(DG)使在源域上训练的机器学习模型能够在未标注甚至未见的目标域上取得良好表现。由于先前UDA&DG语义分割方法大多基于过时网络架构,我们针对更现代的网络结构进行基准测试,揭示了Transformer的潜力,并设计了专用于UDA&DG的DAFormer网络。该网络通过三种训练策略避免对源域的过拟合:(1)稀有类别采样缓解了对常见源域类别的偏向;(2)物体类ImageNet特征距离与(3)学习率预热促进了ImageNet预训练的特征迁移。由于UDA&DG通常需要大量GPU显存,多数先前方法对图像进行降采样或裁剪处理。然而,低分辨率预测往往难以保留精细细节,而基于裁剪图像训练的模型则缺乏捕获长程域鲁棒上下文信息的能力。为此,我们提出适用于UDA&DG的多分辨率框架HRDA,该框架结合小尺寸高分辨率裁剪块以保留精细分割细节,同时利用大尺寸低分辨率裁剪块通过可学习的尺度注意力捕获长程上下文依赖。DAFormer与HRDA在5个不同基准测试中将UDA&DG的最优性能显著提升了超过10个mIoU。实现代码已开源至https://github.com/lhoyer/HRDA。