Humans possess the capability to comprehend diverse modalities and seamlessly transfer information between them. In this work, we introduce ModaVerse, a Multi-modal Large Language Model (MLLM) capable of comprehending and transforming content across various modalities including images, videos, and audio. Predominant MLLM frameworks have largely relied on the alignment of latent spaces of textual and non-textual features. This alignment process, which synchronizes a language model trained on textual data with encoders and decoders trained on multi-modal data, often necessitates extensive training of several projection layers in multiple stages. Inspired by LLM-as-agent methodologies, we propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language. It aligns the LLM's output with the input of generative models, avoiding the complexities associated with latent feature alignments, and simplifying the multiple training stages of existing MLLMs into a single, efficient process. This conceptual advancement leads to significant reductions in both data and computational costs. By conducting experiments on several benchmarks, we demonstrate that our approach attains comparable performance with the state of the art while achieving considerable efficiencies in data usage and training duration.
翻译:人类具备理解多种模态并在其间无缝转换信息的能力。本文提出ModaVerse——一种能够理解并转换图像、视频和音频等多模态内容的多模态大语言模型。现有主流多模态大语言模型框架多依赖于文本与非文本特征潜在空间的对齐。这种将基于文本数据训练的语言模型与基于多模态数据训练的编码器/解码器同步的对齐过程,通常需要多阶段训练多个投影层。受大语言模型即代理方法启发,我们提出一种直接在自然语言层级运作的新型输入/输出对齐机制。该机制将大语言模型的输出与生成模型的输入对齐,避免了潜在特征对齐的复杂性,并将现有MLLM的多阶段训练简化为单一高效流程。这一概念性突破显著降低了数据和计算成本。通过在多个基准上的实验,我们证明该方法在达到与当前最先进技术相当性能的同时,在数据使用和训练时长方面实现了显著效率提升。