Spider: Any-to-Many Multimodal LLM

Multimodal LLMs (MLLMs) have emerged as an extension of Large Language Models (LLMs), enabling the integration of various modalities. However, Any-to-Any MLLMs are limited to generating pairwise modalities 'Text + X' within a single response, such as Text + {Image or Audio or Video}. To address this limitation, we introduce Spider, a novel efficient Any-to-Many Modalities Generation (AMMG) framework, which can generate an arbitrary combination of modalities 'Text + Xs', such as Text + {Image and Audio and Video}. To achieve efficient AMMG, our Spider integrates three core components: a Base Model for basic X-to-X (i.e., Any-to-Any) modality processing, a novel Efficient Decoders-Controller for controlling multimodal Decoders to generate Xs (many-modal) contents, and an Any-to-Many Instruction Template designed for producing Xs signal prompts. To train Spider, we constructed a novel Text-formatted Many-Modal (TMM) dataset, which facilitates the learning of the X-to-Xs (i.e., Any-to-Many) capability necessary for AMMG. Ultimately, the well-trained Spider generates a pseudo X-to-Xs dataset, the first-ever X-to-Xs many-modal dataset, enhancing the potential for AMMG task in future research. Overall, this work not only pushes the boundary of multimodal interaction but also provides rich data support for advancing the field.

翻译：多模态大语言模型（MLLMs）作为大语言模型（LLMs）的扩展，实现了多种模态的融合。然而，现有的任意到任意模态MLLMs在单次响应中仅能生成成对的“文本+X”模态组合，例如文本+{图像或音频或视频}。为突破这一限制，本文提出Spider——一种新颖高效的任意到多模态生成（AMMG）框架，能够生成“文本+Xs”的任意模态组合，例如文本+{图像与音频与视频}。为实现高效的AMMG，Spider集成了三个核心组件：用于基础X到X（即任意到任意）模态处理的基础模型、控制多模态解码器生成Xs（多模态）内容的新型高效解码器控制器，以及为生成Xs信号提示而设计的任意到多模态指令模板。为训练Spider，我们构建了新颖的文本格式化多模态（TMM）数据集，该数据集促进了AMMG所需的X到Xs（即任意到多模态）能力学习。最终，训练完备的Spider生成了首个X到Xs多模态数据集——伪X到Xs数据集，为未来研究中AMMG任务的拓展提供了潜力。总体而言，本研究不仅拓展了多模态交互的边界，更为该领域的推进提供了丰富的数据支持。

相关内容

网络爬虫

关注 13

网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常被称为网页追逐者），是一种按照一定的规则，自动的抓取万维网信息的程序或者脚本，已被广泛应用于互联网领域。搜索引擎使用网络爬虫抓取Web网页、文档甚至图片、音频、视频等资源，通过相应的索引技术组织这些信息，提供给搜索用户进行查询。网络爬虫也为中小站点的推广提供了有效的途径。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日