Learned representation-guided diffusion models for large-image generation

To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by assembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Generating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions.

翻译：为合成高保真样本，扩散模型通常需要辅助数据来指导生成过程。然而，在组织病理学和卫星图像等专业领域，获取精细的块级标注既不可行又耗费大量人力——此类标注通常由领域专家完成且涉及数亿个图像块。现代自监督学习表示编码了丰富的语义与视觉信息。本文提出这种表示具有足够表现力，可替代精细人工标签。我们提出一种创新方法，训练以自监督学习嵌入为条件的扩散模型。这些模型成功将特征投影回高质量的组织病理学和遥感图像。此外，我们通过整合自监督学习嵌入推断出的空间一致性图像块来构建更大尺寸图像，从而保持长程依赖关系。通过生成真实图像的变体来增强数据，可提升下游分类器在块级及更大尺度图像分类任务中的精度。即使面对训练中未出现的数据集，我们的模型依然有效，展现出强鲁棒性与泛化能力。从学习嵌入生成图像的过程与嵌入来源无关。用于生成大图像的自监督学习嵌入既可从参考图像提取，也可从基于任意相关模态（如类别标签、文本、基因组数据）条件化的辅助模型中采样。作为概念验证，我们提出文本到大规模图像生成范式，成功从文本描述中合成出大规模病理与卫星图像。