Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing.
翻译:现有文生图(T2I)扩散模型在处理复杂提示时通常存在困难,尤其是在涉及数量、对象-属性绑定以及多主体描述的场景中。本文引入语义面板作为文本解码至图像的中间层,以支持生成器更好地遵循指令。该面板通过借助大语言模型从输入文本中解析视觉概念并加以排列获得,随后作为精细控制信号注入去噪网络,以补充文本条件。为促进文本到面板的学习,我们提出精心设计的语义格式化协议,并配备全自动数据准备流水线。凭借这一设计,我们提出的方法Ranni能够增强预训练T2I生成器的文本可控性。更重要的是,生成式中间层的引入带来了更便捷的交互形式(即直接调整面板元素或使用语言指令),并支持用户精细定制生成结果。基于此,我们开发了一个实用系统,并展示了其在连续生成和聊天式编辑中的潜力。