Development of multimodal interactive systems is hindered by the lack of rich, multimodal (text, images) conversational data, which is needed in large quantities for LLMs. Previous approaches augment textual dialogues with retrieved images, posing privacy, diversity, and quality constraints. In this work, we introduce \textbf{M}ultimodal \textbf{A}ugmented \textbf{G}enerative \textbf{I}mages \textbf{D}ialogues (MAGID), a framework to augment text-only dialogues with diverse and high-quality images. Subsequently, a diffusion model is applied to craft corresponding images, ensuring alignment with the identified text. Finally, MAGID incorporates an innovative feedback loop between an image description generation module (textual LLM) and image quality modules (addressing aesthetics, image-text matching, and safety), that work in tandem to generate high-quality and multi-modal dialogues. We compare MAGID to other SOTA baselines on three dialogue datasets, using automated and human evaluation. Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation, especially against retrieval baselines where the image database is small.
翻译:多模态交互系统的开发受限于缺乏丰富且大量的多模态(文本、图像)会话数据,这些数据是大语言模型所必需的。先前的方法通过检索图像来增强文本对话,但这会带来隐私、多样性和质量方面的限制。在本工作中,我们提出了**多模态增强生成式图像对话**(MAGID)框架,用于以多样且高质量的图像增强纯文本对话。随后,应用扩散模型生成对应的图像,确保与识别出的文本对齐。最后,MAGID引入了一种创新的反馈循环机制,该机制在图像描述生成模块(文本大语言模型)与图像质量模块(涵盖美学、图文匹配及安全性)之间协同运作,以生成高质量的多模态对话。我们在三个对话数据集上,通过自动评估和人工评估将MAGID与其他最先进的基线方法进行了比较。结果表明,MAGID表现与基线方法相当或更优,尤其在人工评估中取得了显著提升,尤其是在图像数据库规模较小的检索基线方法对比中。