Teaching Text-to-Image Models to Communicate

Various works have been extensively studied in the research of text-to-image generation. Although existing models perform well in text-to-image generation, there are significant challenges when directly employing them to generate images in dialogs. In this paper, we first highlight a new problem: dialog-to-image generation, that is, given the dialog context, the model should generate a realistic image which is consistent with the specified conversation as response. To tackle the problem, we propose an efficient approach for dialog-to-image generation without any intermediate translation, which maximizes the extraction of the semantic information contained in the dialog. Considering the characteristics of dialog structure, we put segment token before each sentence in a turn of a dialog to differentiate different speakers. Then, we fine-tune pre-trained text-to-image models to enable them to generate images conditioning on processed dialog context. After fine-tuning, our approach can consistently improve the performance of various models across multiple metrics. Experimental results on public benchmark demonstrate the effectiveness and practicability of our method.

翻译：在文本到图像生成的研究中，已有大量工作被广泛探讨。尽管现有模型在文本到图像生成方面表现良好，但直接将其用于对话中生成图像仍面临显著挑战。本文首先提出一个新问题：对话到图像生成，即给定对话上下文，模型应生成与指定对话内容一致的逼真图像作为响应。为解决该问题，我们提出了一种无需中间翻译的对话到图像生成高效方法，该方法能最大化提取对话中包含的语义信息。考虑到对话结构的特点，我们在对话轮次中的每个句子前添加分段标记，以区分不同的说话者。随后，我们对预训练的文本到图像模型进行微调，使其能够基于处理后的对话上下文生成图像。经过微调后，我们的方法能够在多个指标上持续提升各类模型的性能。在公开基准上的实验结果证明了我们方法的有效性和实用性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/