Golden Noise for Diffusion Models: A Learning Framework

Text-to-image diffusion model is a popular paradigm that synthesizes personalized images by providing a text prompt and a random Gaussian noise. While people observe that some noises are ``golden noises'' that can achieve better text-image alignment and higher human preference than others, we still lack a machine learning framework to obtain those golden noises. To learn golden noises for diffusion sampling, we mainly make three contributions in this paper. First, we identify a new concept termed the \textit{noise prompt}, which aims at turning a random Gaussian noise into a golden noise by adding a small desirable perturbation derived from the text prompt. Following the concept, we first formulate the \textit{noise prompt learning} framework that systematically learns ``prompted'' golden noise associated with a text prompt for diffusion models. Second, we design a noise prompt data collection pipeline and collect a large-scale \textit{noise prompt dataset}~(NPD) that contains 100k pairs of random noises and golden noises with the associated text prompts. With the prepared NPD as the training dataset, we trained a small \textit{noise prompt network}~(NPNet) that can directly learn to transform a random noise into a golden noise. The learned golden noise perturbation can be considered as a kind of prompt for noise, as it is rich in semantic information and tailored to the given text prompt. Third, our extensive experiments demonstrate the impressive effectiveness and generalization of NPNet on improving the quality of synthesized images across various diffusion models, including SDXL, DreamShaper-xl-v2-turbo, and Hunyuan-DiT. Moreover, NPNet is a small and efficient controller that acts as a plug-and-play module with very limited additional inference and computational costs, as it just provides a golden noise instead of a random noise without accessing the original pipeline.

翻译：文本到图像扩散模型是一种流行的范式，它通过提供文本提示和一个随机高斯噪声来合成个性化图像。虽然人们观察到某些噪声是“黄金噪声”，它们能比其他噪声实现更好的文本-图像对齐和更高的人类偏好，但我们仍然缺乏一个机器学习框架来获取这些黄金噪声。为了学习用于扩散采样的黄金噪声，本文主要做出三点贡献。首先，我们提出了一个称为\textit{噪声提示}的新概念，其目标是通过添加一个从文本提示推导出的微小理想扰动，将一个随机高斯噪声转变为黄金噪声。基于此概念，我们首先构建了\textit{噪声提示学习}框架，该框架系统地学习与扩散模型中文本提示相关联的“被提示”黄金噪声。其次，我们设计了一个噪声提示数据收集流程，并收集了一个大规模的\textit{噪声提示数据集}~(NPD)，其中包含10万对随机噪声和黄金噪声及其关联的文本提示。以准备好的NPD作为训练数据集，我们训练了一个小型\textit{噪声提示网络}~(NPNet)，它可以直接学习将随机噪声转换为黄金噪声。学习到的黄金噪声扰动可被视为一种针对噪声的提示，因为它富含语义信息且针对给定文本提示量身定制。第三，我们广泛的实验证明了NPNet在提升各种扩散模型（包括SDXL、DreamShaper-xl-v2-turbo和Hunyuan-DiT）合成图像质量方面具有显著的有效性和泛化能力。此外，NPNet是一个小型高效控制器，可作为即插即用模块，仅需极少的额外推理和计算成本，因为它仅提供一个黄金噪声而非随机噪声，且无需访问原始流程。