SegDA: Maximum Separable Segment Mask with Pseudo Labels for Domain Adaptive Semantic Segmentation

Unsupervised Domain Adaptation (UDA) aims to solve the problem of label scarcity of the target domain by transferring the knowledge from the label rich source domain. Usually, the source domain consists of synthetic images for which the annotation is easily obtained using the well known computer graphics techniques. However, obtaining annotation for real world images (target domain) require lot of manual annotation effort and is very time consuming because it requires per pixel annotation. To address this problem we propose SegDA module to enhance transfer performance of UDA methods by learning the maximum separable segment representation. This resolves the problem of identifying visually similar classes like pedestrian/rider, sidewalk/road etc. We leveraged Equiangular Tight Frame (ETF) classifier inspired from Neural Collapse for maximal separation between segment classes. This causes the source domain pixel representation to collapse to a single vector forming a simplex vertices which are aligned to the maximal separable ETF classifier. We use this phenomenon to propose the novel architecture for domain adaptation of segment representation for target domain. Additionally, we proposed to estimate the noise in labelling the target domain images and update the decoder for noise correction which encourages the discovery of pixels for classes not identified in pseudo labels. We have used four UDA benchmarks simulating synthetic-to-real, daytime-to-nighttime, clear-to-adverse weather scenarios. Our proposed approach outperforms +2.2 mIoU on GTA -> Cityscapes, +2.0 mIoU on Synthia -> Cityscapes, +5.9 mIoU on Cityscapes -> DarkZurich, +2.6 mIoU on Cityscapes -> ACDC.

翻译：无监督域自适应（UDA）旨在通过从标签丰富的源域迁移知识，解决目标域标签稀缺的问题。通常，源域由利用成熟的计算机图形技术易于获取标注的合成图像构成。然而，真实世界图像（目标域）的标注需要大量人工标注工作，且因逐像素标注需求而极为耗时。为解决此问题，我们提出SegDA模块，通过学习最大可分离分割表示来提升UDA方法的迁移性能，从而解决行人/骑手、人行道/道路等视觉相似类别的识别难题。受神经坍缩理论启发，我们采用等角紧框架（ETF）分类器实现分割类别间的最大分离。该方法使源域像素表示坍缩为单一向量，形成与最大可分离ETF分类器对齐的单纯形顶点。我们利用这一现象提出新颖架构，用于目标域分割表示的域自适应。此外，我们提出估计目标域图像标注噪声的方法，并更新解码器进行噪声校正，从而促进伪标签中未识别类别像素的发现。我们在四种UDA基准场景上进行了验证，涵盖合成到真实、白天到夜间、晴好到恶劣天气等情境。所提方法在GTA→Cityscapes上提升+2.2 mIoU，Synthia→Cityscapes提升+2.0 mIoU，Cityscapes→DarkZurich提升+5.9 mIoU，Cityscapes→ACDC提升+2.6 mIoU。