Recently, the quality and performance of text-to-image generation significantly advanced due to the impressive results of diffusion models. However, text-to-image diffusion models still fail to generate high fidelity content with respect to the input prompt. One problem where text-to-diffusion models struggle is generating the exact number of objects specified in the text prompt. E.g. given a prompt "five apples and ten lemons on a table", diffusion-generated images usually contain the wrong number of objects. In this paper, we propose a method to improve diffusion models to focus on producing the correct object count given the input prompt. We adopt a counting network that performs reference-less class-agnostic counting for any given image. We calculate the gradients of the counting network and refine the predicted noise for each step. To handle multiple types of objects in the prompt, we use novel attention map guidance to obtain high-fidelity masks for each object. Finally, we guide the denoising process by the calculated gradients for each object. Through extensive experiments and evaluation, we demonstrate that our proposed guidance method greatly improves the fidelity of diffusion models to object count.
翻译:近年来,由于扩散模型的显著成果,文本到图像生成的质量和性能得到了极大提升。然而,文本到图像扩散模型在根据输入提示生成高保真内容方面仍存在不足。其中一个难点是生成文本提示中指定数量的物体。例如,对于提示“桌上的五个苹果和十个柠檬”,扩散生成的图像通常包含错误数量的物体。本文提出了一种方法,用于改进扩散模型,使其能根据输入提示准确生成正确数量的物体。我们采用了一个计数网络,该网络能够对任意图像进行无参考的类别无关计数。我们计算计数网络的梯度,并对每一步的预测噪声进行细化。为了处理提示中的多种物体类型,我们使用了一种新颖的注意力图引导方法,以获得每个物体的高保真掩码。最终,我们通过为每个物体计算的梯度来引导去噪过程。通过大量实验和评估,我们证明了所提出的引导方法能够显著提升扩散模型在物体计数方面的保真度。