MIO: A Foundation Model on Multimodal Tokens

Zekun Wang,King Zhu,Chunpu Xu,Wangchunshu Zhou,Jiaheng Liu,Yibo Zhang,Jiashuo Wang,Ning Shi,Siyu Li,Yizhi Li,Haoran Que,Zhaoxiang Zhang,Yuanxing Zhang,Ge Zhang,Ke Xu,Jie Fu,Wenhao Huang

from arxiv, Technical Report. Codes and models will be available soon

In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.

翻译：本文介绍了MIO，一种基于多模态令牌构建的新型基础模型，能够以端到端、自回归的方式理解和生成语音、文本、图像及视频。尽管大型语言模型（LLMs）和多模态大型语言模型（MM-LLMs）凭借其多功能性推动了通用人工智能的进步，但它们仍缺乏真正的任意模态间理解与生成能力。近期发布的GPT-4o展示了任意模态LLMs在复杂现实任务中的巨大潜力，实现了跨图像、语音和文本的全方位输入输出。然而，该系统是闭源的，且不支持生成多模态交错序列。为填补这一空白，我们提出了MIO，该模型通过因果多模态建模，在四种模态的离散令牌混合数据上进行训练。MIO经历了四阶段训练流程：（1）对齐预训练，（2）交错预训练，（3）语音增强预训练，以及（4）针对多样化文本、视觉和语音任务的综合监督微调。实验结果表明，与先前的双模态基线、任意模态模型基线乃至特定模态基线相比，MIO展现出具有竞争力、甚至在某些情况下更优的性能。此外，MIO还表现出其任意模态特性所固有的先进能力，例如交错视频-文本生成、视觉思维链推理、视觉指南生成、指令性图像编辑等。