Humans possess the capability to comprehend diverse modalities and seamlessly transfer information between them. In this work, we introduce ModaVerse, a Multi-modal Large Language Model (MLLM) capable of comprehending and transforming content across various modalities including images, videos, and audio. Predominant MLLM frameworks have largely relied on the alignment of latent spaces of textual and non-textual features. This alignment process, which synchronizes a language model trained on textual data with encoders and decoders trained on multi-modal data, often necessitates extensive training of several projection layers in multiple stages. Inspired by LLM-as-agent methodologies, we propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language. It aligns the LLM's output with the input of generative models, avoiding the complexities associated with latent feature alignments, and simplifying the multiple training stages of existing MLLMs into a single, efficient process. This conceptual advancement leads to significant reductions in both data and computational costs. By conducting experiments on several benchmarks, we demonstrate that our approach attains comparable performance with the state of the art while achieving considerable efficiencies in data usage and training duration.
翻译:人类具备理解多种模态并在其间无缝传递信息的能力。本文提出ModaVerse——一种能够理解并转换包括图像、视频和音频在内的多模态内容的多模态大语言模型(MLLM)。当前主流的多模态大语言模型框架主要依赖文本与非文本特征潜在空间的对齐。这种对齐过程需要通过多个阶段训练大量投影层,以同步基于文本数据训练的语言模型与基于多模态数据训练的编码器和解码器。受“大语言模型即智能体”方法的启发,我们提出一种新型输入/输出(I/O)对齐机制,该机制直接在自然语言层面运作。它将大语言模型的输出与生成模型的输入对齐,避免了潜在特征对齐的复杂性,并将现有MLLM的多阶段训练简化为单一高效流程。这一概念性创新显著降低了数据与计算成本。通过在多个基准上的实验,我们证明该方法在实现与现有技术相当性能的同时,大幅提升了数据利用效率和训练速度。