Various works have been extensively studied in the research of text-to-image generation. Although existing models perform well in text-to-image generation, there are significant challenges when directly employing them to generate images in dialogs. In this paper, we first highlight a new problem: dialog-to-image generation, that is, given the dialog context, the model should generate a realistic image which is consistent with the specified conversation as response. To tackle the problem, we propose an efficient approach for dialog-to-image generation without any intermediate translation, which maximizes the extraction of the semantic information contained in the dialog. Considering the characteristics of dialog structure, we put segment token before each sentence in a turn of a dialog to differentiate different speakers. Then, we fine-tune pre-trained text-to-image models to enable them to generate images conditioning on processed dialog context. After fine-tuning, our approach can consistently improve the performance of various models across multiple metrics. Experimental results on public benchmark demonstrate the effectiveness and practicability of our method.
翻译:在文本到图像生成的研究中,已有大量工作被广泛探讨。尽管现有模型在文本到图像生成方面表现良好,但直接将其用于对话中生成图像仍面临显著挑战。本文首先提出一个新问题:对话到图像生成,即给定对话上下文,模型应生成与指定对话内容一致的逼真图像作为响应。为解决该问题,我们提出了一种无需中间翻译的对话到图像生成高效方法,该方法能最大化提取对话中包含的语义信息。考虑到对话结构的特点,我们在对话轮次中的每个句子前添加分段标记,以区分不同的说话者。随后,我们对预训练的文本到图像模型进行微调,使其能够基于处理后的对话上下文生成图像。经过微调后,我们的方法能够在多个指标上持续提升各类模型的性能。在公开基准上的实验结果证明了我们方法的有效性和实用性。