People often imagine relevant scenes to aid in the writing process. In this work, we aim to utilize visual information for composition in the same manner as humans. We propose a method, LIVE, that makes pre-trained language models (PLMs) Learn to Imagine for Visuallyaugmented natural language gEneration. First, we imagine the scene based on the text: we use a diffusion model to synthesize high-quality images conditioned on the input texts. Second, we use CLIP to determine whether the text can evoke the imagination in a posterior way. Finally, our imagination is dynamic, and we conduct synthesis for each sentence rather than generate only one image for an entire paragraph. Technically, we propose a novel plug-and-play fusion layer to obtain visually-augmented representations for each text. Our vision-text fusion layer is compatible with Transformerbased architecture. We have conducted extensive experiments on four generation tasks using BART and T5, and the automatic results and human evaluation demonstrate the effectiveness of our proposed method. We will release the code, model, and data at the link: https://github.com/RUCAIBox/LIVE.
翻译:人们常常在写作过程中想象相关场景以辅助表达。本文旨在像人类一样利用视觉信息进行文本生成。我们提出LIVE方法,使预训练语言模型(PLMs)学会为视觉增强的自然语言生成进行想象。首先,基于文本想象场景:使用扩散模型根据输入文本合成高质量图像。其次,采用CLIP以事后方式判断文本是否能激发想象。最后,我们的想象是动态的,即对每个句子进行合成,而非为整段仅生成一张图像。技术上,我们提出一种新颖的即插即用融合层,为每个文本获取视觉增强表示。该视觉-文本融合层兼容基于Transformer的架构。我们在BART和T5上对四个生成任务进行了广泛实验,自动评测结果与人工评估共同证明了所提方法的有效性。相关代码、模型及数据将发布在链接:https://github.com/RUCAIBox/LIVE。