Large, pretrained latent diffusion models (LDMs) have demonstrated an extraordinary ability to generate creative content, specialize to user data through few-shot fine-tuning, and condition their output on other modalities, such as semantic maps. However, are they usable as large-scale data generators, e.g., to improve tasks in the perception stack, like semantic segmentation? We investigate this question in the context of autonomous driving, and answer it with a resounding "yes". We propose an efficient data generation pipeline termed DGInStyle. First, we examine the problem of specializing a pretrained LDM to semantically-controlled generation within a narrow domain. Second, we propose a Style Swap technique to endow the rich generative prior with the learned semantic control. Third, we design a Multi-resolution Latent Fusion technique to overcome the bias of LDMs towards dominant objects. Using DGInStyle, we generate a diverse dataset of street scenes, train a domain-agnostic semantic segmentation model on it, and evaluate the model on multiple popular autonomous driving datasets. Our approach consistently increases the performance of several domain generalization methods compared to the previous state-of-the-art methods. The source code and the generated dataset are available at https://dginstyle.github.io.
翻译:大规模预训练潜在扩散模型(LDMs)已展现出生成创造性内容的卓越能力,能够通过少量样本微调适应用户数据,并能基于其他模态(如语义图)条件化其输出。然而,它们是否可用作大规模数据生成器,例如用于改进感知任务(如语义分割)?我们在自动驾驶背景下探讨这一问题,并给出明确的“肯定”答案。我们提出了一种高效的数据生成流程,称为DGInStyle。首先,我们研究了将预训练LDM专门化用于狭窄领域内语义控制生成的问题。其次,我们提出一种风格交换技术,以赋予丰富的生成先验知识以学习到的语义控制能力。第三,我们设计了一种多分辨率潜在融合技术,以克服LDMs对主导对象的生成偏好。利用DGInStyle,我们生成了一个多样化的街景数据集,在其上训练了一个领域无关的语义分割模型,并在多个主流自动驾驶数据集上评估了该模型。与先前最先进方法相比,我们的方法持续提升了多种领域泛化方法的性能。源代码及生成的数据集可在 https://dginstyle.github.io 获取。