ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models

Diffusion models have emerged as the leading approach for text-to-image generation. However, their iterative sampling process, which gradually morphs random noise into coherent images, introduces significant latency that limits their applicability. While recent few-step diffusion models reduce the number of sampling steps to as few as one to four steps, they often compromise image quality and prompt alignment, especially in one-step generation. Additionally, these models require computationally expensive training procedures. To address these limitations, we propose ImageRAGTurbo, a novel approach to efficiently finetune few-step diffusion models via retrieval augmentation. Given a text prompt, we retrieve relevant text-image pairs from a database and use them to condition the generation process. We argue that such retrieved examples provide rich contextual information to the UNet denoiser that helps reduce the number of denoising steps without compromising image quality. Indeed, our initial investigations show that using the retrieved content to edit the denoiser's latent space ($\mathcal{H}$-space) without additional finetuning already improves prompt fidelity. To further improve the quality of the generated images, we augment the UNet denoiser with a trainable adapter in the $\mathcal{H}$-space, which efficiently blends the retrieved content with the target prompt using a cross-attention mechanism. Experimental results on fast text-to-image generation demonstrate that our approach produces high-fidelity images without compromising latency compared to existing methods.

翻译：扩散模型已成为文本到图像生成的主流方法。然而，其迭代采样过程（即逐步将随机噪声转化为连贯图像）引入了显著的延迟，限制了其实际应用。尽管近期提出的少步扩散模型将采样步骤减少至一到四步，但它们往往以牺牲图像质量和提示对齐度为代价，尤其在一步生成中更为明显。此外，这些模型需要计算成本高昂的训练过程。为应对这些局限性，我们提出ImageRAGTurbo——一种通过检索增强高效微调少步扩散模型的新方法。给定文本提示，我们从数据库中检索相关的文本-图像对，并将其用于调节生成过程。我们认为此类检索样本为UNet去噪器提供了丰富的上下文信息，有助于在不降低图像质量的前提下减少去噪步骤。事实上，我们的初步研究表明，直接利用检索内容编辑去噪器的隐空间（$\mathcal{H}$空间）而无需额外微调，即可提升提示保真度。为进一步提高生成图像质量，我们在$\mathcal{H}$空间中为UNet去噪器增加了可训练适配器，该模块通过交叉注意力机制高效融合检索内容与目标提示。在快速文本到图像生成任务上的实验结果表明，相较于现有方法，我们的方法能在保持低延迟的同时生成高保真度图像。