We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.
翻译:我们提出统一输入输出2(Unified-IO 2),这是首个具备理解与生成图像、文本、音频及动作能力的自回归多模态模型。为实现不同模态的统一,我们将图像、文本、音频、动作、边界框等输入与输出标记化至共享语义空间,并通过单一编码器-解码器Transformer模型进行处理。由于使用如此多样的模态进行训练极具挑战性,我们提出了多种架构改进以稳定模型训练过程。基于包含多样化来源的大规模多模态预训练语料库,我们采用多模态去噪器混合目标从零开始训练模型。为学习遵循多模态指令等广泛技能,我们构建了包含120个数据集的集成数据集,通过提示与增强进行微调。作为统一单一模型,Unified-IO 2在GRIT基准测试中取得最优性能,并在图像生成与理解、自然语言理解、视频与音频理解、机器人操作等超过35项基准测试中表现优异。我们向研究社区开源所有模型。