Text-to-image (T2I) diffusion models have demonstrated impressive performance in generating high-fidelity images, largely enabled by text-guided inference. However, this advantage often comes with a critical drawback: limited diversity, as outputs tend to collapse into similar modes under strong text guidance. Existing approaches typically optimize intermediate latents or text conditions during inference, but these methods deliver only modest gains or remain sensitive to hyperparameter tuning. In this work, we introduce Contrastive Noise Optimization, a simple yet effective method that addresses the diversity issue from a distinct perspective. Unlike prior techniques that adapt intermediate latents, our approach shapes the initial noise to promote diverse outputs. Specifically, we develop a contrastive loss defined in the Tweedie data space and optimize a batch of noise latents. Our contrastive optimization repels instances within the batch to maximize diversity while keeping them anchored to a reference sample to preserve fidelity. We further provide theoretical insights into the mechanism of this preprocessing to substantiate its effectiveness. Extensive experiments across multiple T2I backbones demonstrate that our approach achieves a superior quality-diversity Pareto frontier while remaining robust to hyperparameter choices.
翻译:文本到图像(T2I)扩散模型在生成高保真度图像方面已展现出令人瞩目的性能,这主要得益于文本引导的推理过程。然而,这一优势往往伴随着一个关键缺陷:生成多样性受限,因为在强文本引导下,输出倾向于坍缩到相似的模态。现有方法通常在推理过程中优化中间潜在变量或文本条件,但这些方法仅能带来有限的改进,或仍对超参数调优敏感。在本工作中,我们提出了对比噪声优化,这是一种简单而有效的方法,从一个独特的视角解决了多样性问题。与先前调整中间潜在变量的技术不同,我们的方法通过塑造初始噪声来促进多样化的输出。具体而言,我们定义了一个在Tweedie数据空间中的对比损失,并对一批噪声潜在变量进行优化。我们的对比优化过程排斥批次内的实例以最大化多样性,同时将它们锚定在一个参考样本上以保持保真度。我们进一步从理论上分析了这一预处理机制的原理,以证实其有效性。在多个T2I骨干模型上进行的大量实验表明,我们的方法实现了更优的质量-多样性帕累托前沿,同时对超参数选择保持鲁棒性。