Semantic image synthesis aims to generate photo realistic images given a semantic segmentation map. Despite much recent progress, training them still requires large datasets of images annotated with per-pixel label maps that are extremely tedious to obtain. To alleviate the high annotation cost, we propose a transfer method that leverages a model trained on a large source dataset to improve the learning ability on small target datasets via estimated pairwise relations between source and target classes. The class affinity matrix is introduced as a first layer to the source model to make it compatible with the target label maps, and the source model is then further finetuned for the target domain. To estimate the class affinities we consider different approaches to leverage prior knowledge: semantic segmentation on the source domain, textual label embeddings, and self-supervised vision features. We apply our approach to GAN-based and diffusion-based architectures for semantic synthesis. Our experiments show that the different ways to estimate class affinity can be effectively combined, and that our approach significantly improves over existing state-of-the-art transfer approaches for generative image models.
翻译:语义图像合成旨在根据给定的语义分割图生成逼真的图像。尽管近期取得了许多进展,但训练这些模型仍需要大量使用像素级标签图注释的图像数据集,而此类注释的获取极为繁琐。为了降低高昂的标注成本,我们提出一种迁移方法,通过估计源类别与目标类别之间的成对关系,利用在大规模源数据集上训练的模型来提升对小型目标数据集的学习能力。我们将类别亲和性矩阵作为第一层引入源模型,使其与目标标签图兼容,并进一步对目标域上的源模型进行微调。为估计类别亲和性,我们考虑了多种利用先验知识的方法:源域上的语义分割、文本标签嵌入以及自监督视觉特征。我们将该方法应用于基于生成对抗网络和扩散模型的语义合成架构。实验结果表明,不同的类别亲和性估计方法能够有效结合,且我们的方法显著优于现有用于生成图像模型的最先进迁移方法。