Text-to-image diffusion models excel at generating high-quality, diverse images from natural language prompts. However, they often fail to produce semantically accurate results when the prompt contains concept combinations that contradict their learned priors. We define this failure mode as contextual contradiction, where one concept implicitly negates another due to entangled associations learned during training. To address this, we propose a stage-aware prompt decomposition framework that guides the denoising process using a sequence of proxy prompts. Each proxy prompt is constructed to match the semantic content expected to emerge at a specific stage of denoising, while ensuring contextual coherence. To construct these proxy prompts, we leverage a large language model (LLM) to analyze the target prompt, identify contradictions, and generate alternative expressions that preserve the original intent while resolving contextual conflicts. By aligning prompt information with the denoising progression, our method enables fine-grained semantic control and accurate image generation in the presence of contextual contradictions. Experiments across a variety of challenging prompts show substantial improvements in alignment to the textual prompt.
翻译:文本到图像扩散模型在根据自然语言提示生成高质量、多样化图像方面表现出色。然而,当提示包含与其训练先验相矛盾的概念组合时,这些模型往往无法生成语义准确的结果。我们将这种失败模式定义为上下文矛盾,即一个概念因训练期间学到的纠缠关联而隐含地否定另一个概念。为了解决这一问题,我们提出了一种阶段感知的提示分解框架,通过一系列代理提示指导去噪过程。每个代理提示旨在匹配去噪特定阶段预期出现的语义内容,同时确保上下文连贯性。为构建这些代理提示,我们利用大型语言模型分析目标提示、识别矛盾并生成保留原始意图同时解决上下文冲突的替代表达。通过将提示信息与去噪进程对齐,我们的方法能够在存在上下文矛盾的情况下实现细粒度语义控制和准确图像生成。在多种具有挑战性提示上的实验表明,该方法在文本提示对齐方面取得了显著改进。