Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM
翻译:多模态大语言模型(MLLMs)因其强大的多模态理解能力而受到广泛关注。然而,现有工作严重依赖于特定模态的编码器,这些编码器通常架构各异且仅限于常见模态。本文提出OneLLM,这是一个通过统一框架将八种模态与语言对齐的MLLM。我们通过统一的多模态编码器和渐进式多模态对齐流程实现这一目标。具体而言,我们首先训练图像投影模块以连接视觉编码器与大语言模型。随后,通过混合多个图像投影模块并采用动态路由机制,构建通用投影模块(UPM)。最终,我们利用UPM逐步将更多模态与大语言模型对齐。为充分发挥OneLLM在指令跟随任务中的潜力,我们还构建了一个全面的多模态指令数据集,包含来自图像、音频、视频、点云、深度/法线图、惯性测量单元及功能磁共振脑活动数据的200万条样本。OneLLM在25个多样化基准测试中进行了评估,涵盖多模态描述、问答与推理等任务,均展现出卓越性能。代码、数据、模型及在线演示可在 https://github.com/csuhan/OneLLM 获取。