MIO: A Foundation Model on Multimodal Tokens

Zekun Wang,King Zhu,Chunpu Xu,Wangchunshu Zhou,Jiaheng Liu,Yibo Zhang,Jiashuo Wang,Ning Shi,Siyu Li,Yizhi Li,Haoran Que,Zhaoxiang Zhang,Yuanxing Zhang,Ge Zhang,Ke Xu,Jie Fu,Wenhao Huang

from arxiv, Technical Report. Codes and models are available in https://github.com/MIO-Team/MIO

In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.

翻译：本文介绍了一种基于多模态令牌的新型基础模型MIO，该模型能够以端到端、自回归的方式理解和生成语音、文本、图像及视频。尽管大型语言模型（LLMs）和多模态大型语言模型（MM-LLMs）凭借其多功能性推动了通用人工智能的发展，但它们仍缺乏真正的任意模态间理解与生成能力。近期，GPT-4o的发布展示了任意模态间LLMs在处理复杂现实任务方面的巨大潜力，实现了图像、语音和文本的全方位输入与输出。然而，该模型为闭源系统，且不支持生成多模态交错序列。为填补这一空白，我们提出了MIO，该模型通过因果多模态建模方法，在四种模态的离散令牌混合数据上进行训练。MIO的训练过程包含四个阶段：（1）对齐预训练，（2）交错预训练，（3）语音增强预训练，以及（4）针对多样化文本、视觉和语音任务的综合监督微调。实验结果表明，与先前的双模态基线、任意模态间模型基线乃至特定模态基线相比，MIO展现出具有竞争力、甚至在某些方面更优的性能。此外，MIO凭借其任意模态间特性，展现出视频-文本交错生成、视觉思维链推理、视觉指南生成、指令图像编辑等先进能力。