Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM
翻译:多模态大语言模型(MLLMs)因其强大的多模态理解能力而备受关注。然而,现有工作高度依赖模态专用编码器,这些编码器通常结构各异且仅限于常见模态。本文提出OneLLM——一种使用统一框架将八种模态与语言对齐的多模态大语言模型。我们通过统一多模态编码器和渐进式多模态对齐流水线实现该目标。具体而言,首先训练图像投影模块以连接视觉编码器与大语言模型;随后通过混合多个图像投影模块并引入动态路由构建通用投影模块(UPM);最终利用UPM逐步将更多模态与语言模型对齐。为充分释放OneLLM遵循指令的潜力,我们还构建了包含图像、音频、视频、点云、深度/法线图、IMU及fMRI脑活动数据的200万条多模态指令数据集。在涵盖多模态描述、问答与推理等任务的25个多样化基准测试中,OneLLM展现了卓越性能。代码、数据、模型及在线演示已开源至https://github.com/csuhan/OneLLM。