People often imagine relevant scenes to aid in the writing process. In this work, we aim to utilize visual information for composition in the same manner as humans. We propose a method, LIVE, that makes pre-trained language models (PLMs) Learn to Imagine for Visuallyaugmented natural language gEneration. First, we imagine the scene based on the text: we use a diffusion model to synthesize high-quality images conditioned on the input texts. Second, we use CLIP to determine whether the text can evoke the imagination in a posterior way. Finally, our imagination is dynamic, and we conduct synthesis for each sentence rather than generate only one image for an entire paragraph. Technically, we propose a novel plug-and-play fusion layer to obtain visually-augmented representations for each text. Our vision-text fusion layer is compatible with Transformerbased architecture. We have conducted extensive experiments on four generation tasks using BART and T5, and the automatic results and human evaluation demonstrate the effectiveness of our proposed method. We will release the code, model, and data at the link: https://github.com/RUCAIBox/LIVE.
翻译:人们在写作过程中常会想象相关场景以辅助创作。本研究旨在以人类相同的方式利用视觉信息进行文本生成。我们提出LIVE方法,使预训练语言模型(PLMs)学会为视觉增强的自然语言生成进行想象。首先,基于文本想象场景:使用扩散模型合成与输入文本条件相匹配的高质量图像。其次,通过CLIP后验判断文本是否能够激发想象。最后,我们的想象具有动态性——针对每个句子而非整段仅生成一张图像进行合成。技术层面,我们提出新型即插即用融合层,为每个文本获取视觉增强表示。该视觉-文本融合层与基于Transformer的架构兼容。我们使用BART和T5在四项生成任务上进行了广泛实验,自动评测与人工评估结果均证明该方法的有效性。相关代码、模型及数据已在链接https://github.com/RUCAIBox/LIVE 开源。