We present the first diffusion-based framework that can learn an unknown distribution using only highly-corrupted samples. This problem arises in scientific applications where access to uncorrupted samples is impossible or expensive to acquire. Another benefit of our approach is the ability to train generative models that are less likely to memorize individual training samples since they never observe clean training data. Our main idea is to introduce additional measurement distortion during the diffusion process and require the model to predict the original corrupted image from the further corrupted image. We prove that our method leads to models that learn the conditional expectation of the full uncorrupted image given this additional measurement corruption. This holds for any corruption process that satisfies some technical conditions (and in particular includes inpainting and compressed sensing). We train models on standard benchmarks (CelebA, CIFAR-10 and AFHQ) and show that we can learn the distribution even when all the training samples have $90\%$ of their pixels missing. We also show that we can finetune foundation models on small corrupted datasets (e.g. MRI scans with block corruptions) and learn the clean distribution without memorizing the training set.
翻译:我们提出了首个基于扩散的框架,该框架仅使用高度损坏的样本即可学习未知分布。这一问题出现在无法获取或获取未损坏样本代价高昂的科学应用中。我们方法的另一个优势是能够训练出不易记忆个别训练样本的生成模型,因为这些模型从未接触过干净的训练数据。我们的核心思想是在扩散过程中引入额外的测量失真,并让模型根据进一步损坏的图像预测原始损坏图像。我们证明,该方法能够引导模型学习给定额外测量损坏条件下完整未损坏图像的条件期望。这一结论适用于满足某些技术条件(特别包括图像修复和压缩感知)的任何损坏过程。我们在标准基准数据集(CelebA、CIFAR-10和AFHQ)上训练模型,结果表明即使所有训练样本缺失90%的像素,我们仍能学习其分布。我们还展示了如何在小型损坏数据集(例如存在块状损坏的MRI扫描图像)上微调基础模型,从而在不记忆训练集的情况下学习干净分布。