Text-to-image diffusion models generate images by gradually converting white Gaussian noise into a natural image. White Gaussian noise is well suited for producing diverse outputs from a single text prompt due to its absence of structure. However, this very property limits control over, and predictability of, specific visual attributes, as the noise is not human-interpretable. In this work, we investigate the characteristics of the input noise in diffusion models. We show that, although all frequencies in white Gaussian noise have comparable statistical energy, low-frequency components primarily determine the images global structure and color composition, while high-frequency components control finer details. Building on this observation, we demonstrate that simple manipulations of the low-frequency noise using low-frequency image priors can effectively condition the generation process to reconstruct these low-frequency visual cues. This allows us to define a simple, training-free method with minimal overhead that steers overall image structure and color, while letting high-frequency components freely emerge as fine details, enabling variability across generated outputs.
翻译:文本到图像扩散模型通过将高斯白噪声逐步转化为自然图像来生成图像。高斯白噪声因缺乏结构特性,特别适合从单一文本提示生成多样化输出。然而,这种特性也限制了对特定视觉属性的可控性与可预测性,因为噪声本身不具备人类可解释性。本研究探究扩散模型中输入噪声的特征,发现尽管高斯白噪声各频段统计能量相当,但低频分量主要决定图像整体结构与色彩构成,而高频分量控制细节纹理。基于此发现,我们证明利用低频图像先验对低频噪声进行简单操控,即可有效引导生成过程重建这些低频视觉线索。由此定义了一种无需训练、计算开销极低的轻量方法,既能调控图像整体结构与色彩,又能让高频分量自由涌现为精细细节,从而保证生成结果的多样性。