We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/
翻译:我们提出AnyGPT,一种任意到任意(any-to-any)的多模态语言模型,它利用离散表示实现语音、文本、图像和音乐等多种模态的统一处理。AnyGPT可在不改变当前大语言模型架构或训练范式的情况下稳定训练,仅依赖数据级预处理,从而像融入新语言一样,将新模态无缝集成到大语言模型中。我们构建了一个多模态文本中心数据集用于多模态对齐预训练,并利用生成模型合成了首个大规模任意到任意多模态指令数据集。该数据集包含10.8万个多轮对话样本,将多种模态错综交织,使模型能够处理任意组合的多模态输入与输出。实验结果表明,AnyGPT能够实现任意到任意多模态对话,同时在所有模态上达到与专用模型相当的性能,证明离散表示可有效且便捷地将多种模态统一于语言模型之内。演示请见https://junzhan2000.github.io/AnyGPT.github.io/