In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.
翻译:本文介绍了一种基于多模态令牌的新型基础模型MIO,该模型能够以端到端、自回归的方式理解和生成语音、文本、图像及视频。尽管大型语言模型(LLMs)和多模态大型语言模型(MM-LLMs)凭借其多功能性推动了通用人工智能的发展,但它们仍缺乏真正的任意模态间理解与生成能力。近期,GPT-4o的发布展示了任意模态间LLMs在处理复杂现实任务方面的巨大潜力,实现了图像、语音和文本的全方位输入与输出。然而,该模型为闭源系统,且不支持生成多模态交错序列。为填补这一空白,我们提出了MIO,该模型通过因果多模态建模方法,在四种模态的离散令牌混合数据上进行训练。MIO的训练过程包含四个阶段:(1)对齐预训练,(2)交错预训练,(3)语音增强预训练,以及(4)针对多样化文本、视觉和语音任务的综合监督微调。实验结果表明,与先前的双模态基线、任意模态间模型基线乃至特定模态基线相比,MIO展现出具有竞争力、甚至在某些方面更优的性能。此外,MIO凭借其任意模态间特性,展现出视频-文本交错生成、视觉思维链推理、视觉指南生成、指令图像编辑等先进能力。