Text-to-image diffusion models generate images by gradually converting white Gaussian noise into a natural image. White Gaussian noise is well suited for producing diverse outputs from a single text prompt due to its absence of structure. However, this very property limits control over, and predictability of, specific visual attributes, as the noise is not human-interpretable. In this work, we investigate the characteristics of the input noise in diffusion models. We show that, although all frequencies in white Gaussian noise have comparable statistical energy, low-frequency components primarily determine the images global structure and color composition, while high-frequency components control finer details. Building on this observation, we demonstrate that simple manipulations of the low-frequency noise using low-frequency image priors can effectively condition the generation process to reconstruct these low-frequency visual cues. This allows us to define a simple, training-free method with minimal overhead that steers overall image structure and color, while letting high-frequency components freely emerge as fine details, enabling variability across generated outputs.
翻译:文本到图像扩散模型通过逐步将高斯白噪声转换为自然图像来生成图像。高斯白噪声因缺乏结构化特征,特别适合从单一文本提示中生成多样化输出。然而,正是这一特性限制了对特定视觉属性的控制与可预测性,因为噪声本身不具备人类可解释性。本研究深入探究扩散模型中输入噪声的特性,发现尽管高斯白噪声中所有频率成分具有相近的统计能量,但低频分量主要决定图像的全局结构与色彩构成,而高频分量则控制更精细的细节。基于这一发现,我们证明通过利用低频图像先验对低频噪声进行简单操控,即可有效约束生成过程以重建这些低频视觉线索。由此我们定义了一种无需训练、计算开销极小的简单方法,既能引导图像整体结构与色彩走向,又允许高频分量作为精细细节自由涌现,从而保持生成结果的多样性。