We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/
翻译:我们提出AnyGPT,一种任意到任意模态的语言模型,该模型利用离散表示统一处理包括语音、文本、图像和音乐在内的多种模态。AnyGPT能够在无需对现有大语言模型架构或训练范式进行任何改动的情况下稳定训练。相反,它仅依赖数据层面的预处理,使得新模态能够像融入新语言一样无缝集成到大语言模型中。我们构建了一个以文本为中心的多模态数据集用于多模态对齐预训练。利用生成模型,我们合成了首个大规模任意到任意多模态指令数据集。该数据集包含108k个多轮对话样本,这些样本将多种模态交织融合,从而使模型能够处理任意组合的多模态输入与输出。实验结果表明,AnyGPT能够实现任意到任意多模态对话,同时在所有模态上取得与专用模型相当的性能,这证明了离散表示能够高效便捷地在语言模型中统一多种模态。演示请见https://junzhan2000.github.io/AnyGPT.github.io/