Building general-purpose models that can perceive diverse real-world modalities and solve various tasks is an appealing target in artificial intelligence. In this paper, we present ChatBridge, a novel multimodal language model that leverages the expressive capabilities of language as the catalyst to bridge the gap between various modalities. We show that only language-paired two-modality data is sufficient to connect all modalities. ChatBridge leverages recent large language models (LLM) and extends their zero-shot capabilities to incorporate diverse multimodal inputs. ChatBridge undergoes a two-stage training. The first stage aligns each modality with language, which brings emergent multimodal correlation and collaboration abilities. The second stage instruction-finetunes ChatBridge to align it with user intent with our newly proposed multimodal instruction tuning dataset, named MULTIS, which covers a wide range of 16 multimodal tasks of text, image, video, and audio modalities. We show strong quantitative and qualitative results on zero-shot multimodal tasks covering text, image, video, and audio modalities. All codes, data, and models of ChatBridge will be open-sourced.
翻译:摘要:构建能够感知多样化现实世界模态并解决多种任务的通用模型是人工智能领域的一个诱人目标。本文提出ChatBridge,一种新颖的多模态语言模型,利用语言作为催化剂的强大表达能力来弥合不同模态之间的鸿沟。我们发现,仅需语言配对的双模态数据就足以连接所有模态。ChatBridge借助最新的大语言模型(LLM),将其零样本能力扩展到融合多样化的多模态输入。ChatBridge采用两阶段训练:第一阶段将每种模态与语言对齐,从而产生涌现性的多模态关联与协作能力;第二阶段使用我们新提出的多模态指令微调数据集MULTIS对ChatBridge进行指令微调,使其与用户意图对齐。该数据集涵盖文本、图像、视频和音频四种模态的16种多模态任务。我们在涵盖文本、图像、视频和音频模态的零样本多模态任务上展示了强大的定量与定性结果。ChatBridge的所有代码、数据和模型将开源。