Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing.

翻译：现有文生图（T2I）扩散模型在处理复杂提示时通常存在困难，尤其是在涉及数量、对象-属性绑定以及多主体描述的场景中。本文引入语义面板作为文本解码至图像的中间层，以支持生成器更好地遵循指令。该面板通过借助大语言模型从输入文本中解析视觉概念并加以排列获得，随后作为精细控制信号注入去噪网络，以补充文本条件。为促进文本到面板的学习，我们提出精心设计的语义格式化协议，并配备全自动数据准备流水线。凭借这一设计，我们提出的方法Ranni能够增强预训练T2I生成器的文本可控性。更重要的是，生成式中间层的引入带来了更便捷的交互形式（即直接调整面板元素或使用语言指令），并支持用户精细定制生成结果。基于此，我们开发了一个实用系统，并展示了其在连续生成和聊天式编辑中的潜力。

相关内容

Middleware

关注 0

International Middleware会议是讨论中间件设计、构造和使用方面的重要创新和最新进展的论坛。中间件是位于应用程序和底层平台（操作系统；数据库；硬件）之间的分布式系统软件，和/或将分布式应用程序、数据库或设备连接在一起。它的主要作用是协调和实现不同层或组件之间的通信，同时将分布的大部分复杂性隔离为一个单一的、经过充分测试和理解的系统抽象。官网链接：http://www.middleware-conference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日