The success of text-guided diffusion models has established a new image generation paradigm driven by the iterative refinement of text prompts. However, modifying the original text prompt to achieve the expected semantic adjustments often results in unintended global structure changes that disrupt user intent. Existing methods rely on empirical feature map selection for intervention, whose performance heavily depends on appropriate selection, leading to suboptimal stability. This paper tries to solve the aforementioned problem from a frequency perspective and analyzes the impact of the frequency spectrum of noisy latent variables on the hierarchical emergence of the structure framework and fine-grained textures during the generation process. We find that lower-frequency components are primarily responsible for establishing the structure framework in the early generation stage. Their influence diminishes over time, giving way to higher-frequency components that synthesize fine-grained textures. In light of this, we propose a training-free frequency modulation method utilizing a frequency-dependent weighting function with dynamic decay. This method maintains the structure framework consistency while permitting targeted semantic modifications. By directly manipulating the noisy latent variable, the proposed method avoids the empirical selection of internal feature maps. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art methods, achieving an effective balance between preserving structure and enabling semantic updates.
翻译:文本引导扩散模型通过迭代优化文本提示建立了一种新的图像生成范式。然而,修改原始文本提示以实现预期语义调整时,常导致非预期的全局结构变化,从而破坏用户意图。现有方法依赖经验性特征图选择进行干预,其性能高度依赖于适当的选择,导致稳定性欠佳。本文尝试从频率视角解决上述问题,分析了生成过程中含噪隐变量的频谱对结构框架层级化涌现与细粒度纹理生成的影响。我们发现,低频分量主要在生成早期阶段负责建立结构框架,其影响力随时间递减,为合成细粒度纹理的高频分量让路。基于此,我们提出一种无需训练的频域调制方法,采用具有动态衰减特性的频率相关加权函数。该方法在保持结构框架一致性的同时,允许进行针对性语义修改。通过直接操作含噪隐变量,所提方法避免了内部特征图的经验性选择。大量实验表明,所提方法显著优于当前最先进方法,在保持结构与实现语义更新之间取得了有效平衡。