Text-to-image synthesis (T2I) has advanced remarkably with the emergence of large-scale diffusion models. In the conventional setup, the text prompt provides explicit, user-defined guidance, directing the generation process by denoising a randomly sampled Gaussian noise. In this work, we reveal that the often-overlooked noise itself encodes inherent generative tendencies, acting as a "silent prompt" that implicitly guides the output. This implicit guidance, embedded in the noise scheduler design of diffusion model formulations and their training stages, generalizes across a wide range of T2I models and backbones. Building on this insight, we introduce NoiseQuery, a novel strategy that selects optimal initial noise from a pre-built noise library to meet diverse user needs. Our approach not only enhances high-level semantic alignment with text prompts, but also allows for nuanced adjustments of low-level visual attributes, such as texture, sharpness, shape, and color, which are typically challenging to control through text alone. Extensive experiments across various models and target attributes demonstrate the strong performance and zero-shot transferability of our approach, requiring no additional optimization.
翻译:文本到图像合成(T2I)随着大规模扩散模型的出现取得了显著进展。在传统设置中,文本提示提供明确的、用户定义的引导,通过对随机采样的高斯噪声进行去噪来指导生成过程。在这项工作中,我们发现常被忽视的噪声本身编码了固有的生成倾向,充当了一种"无声提示",隐式地引导输出。这种隐式引导嵌入在扩散模型公式及其训练阶段的噪声调度器设计中,在广泛的T2I模型和骨干网络中具有普适性。基于这一见解,我们提出了NoiseQuery,一种新颖的策略,它从预构建的噪声库中选择最优的初始噪声以满足多样化的用户需求。我们的方法不仅增强了与文本提示的高层语义对齐,还允许对低层视觉属性(如纹理、清晰度、形状和颜色)进行细微调整,这些属性通常仅通过文本难以控制。在各种模型和目标属性上进行的大量实验证明了我们方法的强大性能和零样本可迁移性,且无需额外的优化。