Domain generalization (DG) remains a significant challenge for perception based on deep neural networks (DNN), where domain shifts occur due to lighting, weather, or geolocation changes. In this work, we propose VLTSeg to enhance domain generalization in semantic segmentation, where the network is solely trained on the source domain and evaluated on unseen target domains. Our method leverages the inherent semantic robustness of vision-language models. First, by substituting traditional vision-only backbones with pre-trained encoders from CLIP and EVA-CLIP as transfer learning setting we find that in the field of DG, vision-language pre-training significantly outperforms supervised and self-supervised vision pre-training. We thus propose a new vision-language approach for domain generalized segmentation, which improves the domain generalization SOTA by 7.6% mIoU when training on the synthetic GTA5 dataset. We further show the superior generalization capabilities of vision-language segmentation models by reaching 76.48% mIoU on the popular Cityscapes-to-ACDC benchmark, outperforming the previous SOTA approach by 6.9% mIoU on the test set at the time of writing. Additionally, our approach shows strong in-domain generalization capabilities indicated by 86.1% mIoU on the Cityscapes test set, resulting in a shared first place with the previous SOTA on the current leaderboard at the time of submission.
翻译:域泛化(DG)仍是基于深度神经网络(DNN)感知领域面临的重大挑战,光照、天气或地理位置变化会导致域偏移。本文提出VLTSeg方法,旨在增强语义分割中的域泛化能力,该网络仅在源域上训练,并在未见过的目标域上进行评估。我们的方法利用了视觉-语言模型固有的语义鲁棒性。首先,通过将传统纯视觉骨干网络替换为来自CLIP和EVA-CLIP的预训练编码器作为迁移学习设置,我们发现:在域泛化领域,视觉-语言预训练显著优于有监督和自监督视觉预训练。因此,我们提出一种面向域泛化分割的新型视觉-语言方法,在合成GTA5数据集上训练时,该方法将域泛化最先进水平(SOTA)提升了7.6%的mIoU。我们进一步展示了视觉-语言分割模型的卓越泛化能力:在经典的Cityscapes-to-ACDC基准测试中达到76.48%的mIoU,较当时测试集上的先前SOTA方法提升了6.9%的mIoU。此外,我们的方法展现出强大的域内泛化能力,在Cityscapes测试集上达到86.1%的mIoU,在提交时与先前SOTA共同占据当前排行榜首位。