Diffusion models have exhibited impressive prowess in the text-to-image task. Recent methods add image-level structure controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images. This controlling process is globally operated on the entire image, which limits the flexibility of control regions. In this paper, we explore a novel and practical task setting: local control. It focuses on controlling specific local region according to user-defined image conditions, while the remaining regions are only conditioned by the original text prompt. However, it is non-trivial to achieve local conditional controlling. The naive manner of directly adding local conditions may lead to the local control dominance problem, which forces the model to focus on the controlled region and neglect object generation in other regions. To mitigate this problem, we propose Regional Discriminate Loss to update the noised latents, aiming at enhanced object generation in non-control regions. Furthermore, the proposed Focused Token Response suppresses weaker attention scores which lack the strongest response to enhance object distinction and reduce duplication. Lastly, we adopt Feature Mask Constraint to reduce quality degradation in images caused by information differences across the local control region. All proposed strategies are operated at the inference stage. Extensive experiments demonstrate that our method can synthesize high-quality images aligned with the text prompt under local control conditions.
翻译:扩散模型在文本到图像任务中展现了令人印象深刻的能力。现有方法通过添加图像级结构控制(例如边缘图和深度图),结合文本提示共同操纵生成过程以获得期望的图像。然而,这种控制过程是在整个图像上全局进行的,限制了控制区域的灵活性。本文探索了一种新颖且实用的任务设定:局部控制。该设定旨在根据用户定义的图像条件控制特定局部区域,而其余区域仅受原始文本提示的条件约束。然而,实现局部条件控制并非易事。直接添加局部条件的朴素方法可能导致局部控制主导问题,迫使模型过度关注受控区域而忽视其他区域的对象生成。为缓解此问题,我们提出区域判别损失来更新带噪声的隐变量,旨在增强非控制区域的对象生成。此外,所提出的聚焦令牌响应通过抑制缺乏最强响应的较弱注意力分数,以增强对象区分度并减少重复生成。最后,我们采用特征掩码约束来减轻由局部控制区域间信息差异引起的图像质量下降。所有提出的策略均在推理阶段实施。大量实验表明,我们的方法能够在局部控制条件下合成与文本提示对齐的高质量图像。