Diffusion models have dominated the field of large, generative image models, with the prime examples of Stable Diffusion and DALL-E 3 being widely adopted. These models have been trained to perform text-conditioned generation on vast numbers of image-caption pairs and as a byproduct, have acquired general knowledge about natural image statistics. However, when confronted with the task of constrained sampling, e.g. generating the right half of an image conditioned on the known left half, applying these models is a delicate and slow process, with previously proposed algorithms relying on expensive iterative operations that are usually orders of magnitude slower than text-based inference. This is counter-intuitive, as image-conditioned generation should rely less on the difficult-to-learn semantic knowledge that links captions and imagery, and should instead be achievable by lower-level correlations among image pixels. In practice, inverse models are trained or tuned separately for each inverse problem, e.g. by providing parts of images during training as an additional condition, to allow their application in realistic settings. However, we argue that this is not necessary and propose an algorithm for fast-constrained sampling in large pre-trained diffusion models (Stable Diffusion) that requires no expensive backpropagation operations through the model and produces results comparable even to the state-of-the-art \emph{tuned} models. Our method is based on a novel optimization perspective to sampling under constraints and employs a numerical approximation to the expensive gradients, previously computed using backpropagation, incurring significant speed-ups.
翻译:扩散模型已主导大规模生成式图像模型领域,以Stable Diffusion和DALL-E 3为代表的模型被广泛采用。这些模型通过在海量图像-标题对上进行文本条件生成训练,同时习得了自然图像统计特性的通用知识。然而,当面对约束采样任务时(例如基于已知左半部分生成图像的右半部分),应用这些模型成为一个精细且缓慢的过程:先前提出的算法依赖于昂贵的迭代操作,其速度通常比基于文本的推理慢数个数量级。这一现象有悖直觉,因为图像条件生成本应更少依赖难以学习的标题与图像间的语义关联,而应通过图像像素间的底层相关性实现。实践中,通常需针对每个逆问题单独训练或微调逆向模型(例如在训练中将图像局部作为附加条件),以使其适用于实际场景。但我们论证了这种单独训练并非必要,并提出一种适用于大型预训练扩散模型(Stable Diffusion)的快速约束采样算法。该方法无需通过模型进行昂贵的反向传播运算,且生成结果可与当前最先进的微调模型相媲美。我们的方法基于约束采样的新颖优化视角,采用数值逼近技术替代原先依赖反向传播计算的高成本梯度,从而实现了显著的加速效果。