X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4, based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous visual language models. We attribute this to the use of more advanced LLMs compared with previous multimodal models. Unfortunately, the model architecture and training strategies of GPT-4 are unknown. To endow LLMs with multimodal capabilities, we propose X-LLM, which converts Multi-modalities (images, speech, videos) into foreign languages using X2L interfaces and inputs them into a large Language model (ChatGLM). Specifically, X-LLM aligns multiple frozen single-modal encoders and a frozen LLM using X2L interfaces, where ``X'' denotes multi-modalities such as image, speech, and videos, and ``L'' denotes languages. X-LLM's training consists of three stages: (1) Converting Multimodal Information: The first stage trains each X2L interface to align with its respective single-modal encoder separately to convert multimodal information into languages. (2) Aligning X2L representations with the LLM: single-modal encoders are aligned with the LLM through X2L interfaces independently. (3) Integrating multiple modalities: all single-modal encoders are aligned with the LLM through X2L interfaces to integrate multimodal capabilities into the LLM. Our experiments show that X-LLM demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 84.5\% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. And we also conduct quantitative tests on using LLM for ASR and multimodal ASR, hoping to promote the era of LLM-based speech recognition.

翻译：大型语言模型（LLM）已展现出卓越的语言能力。基于先进LLM的GPT-4表现出超越以往视觉语言模型的非凡多模态能力。我们将其归因于相较于以往多模态模型使用了更先进的LLM。遗憾的是，GPT-4的模型架构和训练策略尚不明确。为赋予LLM多模态能力，我们提出X-LLM，通过X2L接口将多模态信息（图像、语音、视频）转化为外语，并将其输入大型语言模型（ChatGLM）。具体而言，X-LLM利用X2L接口对齐多个冻结的单模态编码器与冻结的LLM，其中“X”代表图像、语音、视频等多模态信息，“L”代表语言。X-LLM的训练包含三个阶段：（1）转换多模态信息：第一阶段分别训练每个X2L接口以对齐其对应的单模态编码器，从而将多模态信息转换为语言。（2）对齐X2L表示与LLM：通过X2L接口独立对齐单模态编码器与LLM。（3）整合多模态：通过X2L接口将所有单模态编码器与LLM对齐，将多模态能力集成至LLM。实验表明，X-LLM展现出令人印象深刻的多模态对话能力，有时在未见图像/指令上表现出多模态GPT-4的行为，并在合成多模态指令遵循数据集上达到GPT-4相对分数的84.5%。此外，我们还在基于LLM的自动语音识别（ASR）与多模态ASR上进行了定量测试，以期推动基于LLM的语音识别时代发展。