Multimodal LLMs (MLLMs) have emerged as an extension of Large Language Models (LLMs), enabling the integration of various modalities. However, Any-to-Any MLLMs are limited to generating pairwise modalities 'Text + X' within a single response, such as Text + {Image or Audio or Video}. To address this limitation, we introduce Spider, a novel efficient Any-to-Many Modalities Generation (AMMG) framework, which can generate an arbitrary combination of modalities 'Text + Xs', such as Text + {Image and Audio and Video}. To achieve efficient AMMG, our Spider integrates three core components: a Base Model for basic X-to-X (i.e., Any-to-Any) modality processing, a novel Efficient Decoders-Controller for controlling multimodal Decoders to generate Xs (many-modal) contents, and an Any-to-Many Instruction Template designed for producing Xs signal prompts. To train Spider, we constructed a novel Text-formatted Many-Modal (TMM) dataset, which facilitates the learning of the X-to-Xs (i.e., Any-to-Many) capability necessary for AMMG. Ultimately, the well-trained Spider generates a pseudo X-to-Xs dataset, the first-ever X-to-Xs many-modal dataset, enhancing the potential for AMMG task in future research. Overall, this work not only pushes the boundary of multimodal interaction but also provides rich data support for advancing the field.
翻译:多模态大语言模型(MLLMs)作为大语言模型(LLMs)的扩展,实现了多种模态的融合。然而,现有的任意到任意模态MLLMs在单次响应中仅能生成成对的“文本+X”模态组合,例如文本+{图像或音频或视频}。为突破这一限制,本文提出Spider——一种新颖高效的任意到多模态生成(AMMG)框架,能够生成“文本+Xs”的任意模态组合,例如文本+{图像与音频与视频}。为实现高效的AMMG,Spider集成了三个核心组件:用于基础X到X(即任意到任意)模态处理的基础模型、控制多模态解码器生成Xs(多模态)内容的新型高效解码器控制器,以及为生成Xs信号提示而设计的任意到多模态指令模板。为训练Spider,我们构建了新颖的文本格式化多模态(TMM)数据集,该数据集促进了AMMG所需的X到Xs(即任意到多模态)能力学习。最终,训练完备的Spider生成了首个X到Xs多模态数据集——伪X到Xs数据集,为未来研究中AMMG任务的拓展提供了潜力。总体而言,本研究不仅拓展了多模态交互的边界,更为该领域的推进提供了丰富的数据支持。