Text-to-image diffusion models have advanced towards more controllable generation via supporting various image conditions (e.g., depth map) beyond text. However, these models are learned based on the premise of perfect alignment between the text and image conditions. If this alignment is not satisfied, the final output could be either dominated by one condition, or ambiguity may arise, failing to meet user expectations. To address this issue, we present a training-free approach called "Decompose and Realign'' to further improve the controllability of existing models when provided with partially aligned conditions. The ``Decompose'' phase separates conditions based on pair relationships, computing scores individually for each pair. This ensures that each pair no longer has conflicting conditions. The "Realign'' phase aligns these independently calculated scores via a cross-attention mechanism to avoid new conflicts when combing them back. Both qualitative and quantitative results demonstrate the effectiveness of our approach in handling unaligned conditions, which performs favorably against recent methods and more importantly adds flexibility to the controllable image generation process.
翻译:文本到图像扩散模型通过支持除文本外的多种图像条件(如深度图),已朝着更可控的生成方向发展。然而,这些模型基于文本与图像条件完美对齐的前提进行训练。若这种对齐不满足,最终输出可能被某一条件主导,或产生歧义,从而无法满足用户期望。为解决此问题,我们提出了一种无需训练的方法——"分解与对齐",以在给定部分对齐条件时进一步提升现有模型的可控性。"分解"阶段根据配对关系分离条件,并分别为每个配对计算得分,确保每个配对不再存在冲突条件。"对齐"阶段通过交叉注意力机制对齐这些独立计算的得分,以避免将它们重新组合时产生新的冲突。定性与定量结果均证明了我们方法在处理未对齐条件时的有效性,其性能优于近期方法,更重要的是为可控图像生成过程增添了灵活性。