While originally designed for image generation, diffusion models have recently shown to provide excellent pretrained feature representations for semantic segmentation. Intrigued by this result, we set out to explore how well diffusion-pretrained representations generalize to new domains, a crucial ability for any representation. We find that diffusion-pretraining achieves extraordinary domain generalization results for semantic segmentation, outperforming both supervised and self-supervised backbone networks. Motivated by this, we investigate how to utilize the model's unique ability of taking an input prompt, in order to further enhance its cross-domain performance. We introduce a scene prompt and a prompt randomization strategy to help further disentangle the domain-invariant information when training the segmentation head. Moreover, we propose a simple but highly effective approach for test-time domain adaptation, based on learning a scene prompt on the target domain in an unsupervised manner. Extensive experiments conducted on four synthetic-to-real and clear-to-adverse weather benchmarks demonstrate the effectiveness of our approaches. Without resorting to any complex techniques, such as image translation, augmentation, or rare-class sampling, we set a new state-of-the-art on all benchmarks. Our implementation will be publicly available at \url{https://github.com/ETHRuiGong/PTDiffSeg}.
翻译:尽管扩散模型最初是为图像生成设计的,但最近研究表明其可提供用于语义分割的卓越预训练特征表示。受此结果启发,我们着手探索扩散预训练表示如何泛化到新领域——这是任何表示的关键能力。我们发现,扩散预训练在语义分割上实现了非凡的域泛化效果,优于有监督和自监督骨干网络。基于此,我们研究如何利用模型接受输入提示的独特能力,以进一步提升其跨域性能。我们引入场景提示和提示随机化策略,在训练分割头时帮助进一步解耦域不变信息。此外,我们提出一种简单但高效的测试时域适应方法,该方法基于在目标域上以无监督方式学习场景提示。在四个合成到真实以及晴朗到恶劣天气基准上的大量实验证明了我们方法的有效性。无需依赖图像转换、增强或稀有类别采样等复杂技术,我们在所有基准上均达到了新的最优水平。我们的实现将在 \url{https://github.com/ETHRuiGong/PTDiffSeg} 公开提供。