MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

Development of multimodal interactive systems is hindered by the lack of rich, multimodal (text, images) conversational data, which is needed in large quantities for LLMs. Previous approaches augment textual dialogues with retrieved images, posing privacy, diversity, and quality constraints. In this work, we introduce \textbf{M}ultimodal \textbf{A}ugmented \textbf{G}enerative \textbf{I}mages \textbf{D}ialogues (MAGID), a framework to augment text-only dialogues with diverse and high-quality images. Subsequently, a diffusion model is applied to craft corresponding images, ensuring alignment with the identified text. Finally, MAGID incorporates an innovative feedback loop between an image description generation module (textual LLM) and image quality modules (addressing aesthetics, image-text matching, and safety), that work in tandem to generate high-quality and multi-modal dialogues. We compare MAGID to other SOTA baselines on three dialogue datasets, using automated and human evaluation. Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation, especially against retrieval baselines where the image database is small.

翻译：多模态交互系统的开发受限于缺乏丰富且大量的多模态（文本、图像）会话数据，这些数据是大语言模型所必需的。先前的方法通过检索图像来增强文本对话，但这会带来隐私、多样性和质量方面的限制。在本工作中，我们提出了**多模态增强生成式图像对话**（MAGID）框架，用于以多样且高质量的图像增强纯文本对话。随后，应用扩散模型生成对应的图像，确保与识别出的文本对齐。最后，MAGID引入了一种创新的反馈循环机制，该机制在图像描述生成模块（文本大语言模型）与图像质量模块（涵盖美学、图文匹配及安全性）之间协同运作，以生成高质量的多模态对话。我们在三个对话数据集上，通过自动评估和人工评估将MAGID与其他最先进的基线方法进行了比较。结果表明，MAGID表现与基线方法相当或更优，尤其在人工评估中取得了显著提升，尤其是在图像数据库规模较小的检索基线方法对比中。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日