Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even better than GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First of all, we present the basic formulation of MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages, and scenarios. We continue with multimodal hallucination and extended techniques, including Multimodal ICL (M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To conclude the paper, we discuss existing challenges and point out promising research directions. In light of the fact that the era of MLLM has only just begun, we will keep updating this survey and hope it can inspire more research. An associated GitHub link collecting the latest papers is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.
翻译:近年来,以GPT-4V为代表的多模态大语言模型(MLLM)已成为新兴的研究热点,其利用强大的大语言模型(LLM)作为“大脑”来执行多模态任务。MLLM所展现的惊人涌现能力——例如基于图像编写故事、无需OCR的数学推理——在传统多模态方法中较为罕见,这暗示了通往通用人工智能的一条潜在路径。为此,学术界与工业界均致力于开发能够媲美甚至超越GPT-4V的MLLM,以前所未有的速度推进研究边界。本文旨在梳理与总结近期MLLM的研究进展。首先,我们阐述MLLM的基本范式,并厘清其相关概念,包括模型架构、训练策略与数据以及评估方法。随后,我们介绍关于如何扩展MLLM以支持更细粒度、更多模态、更多语言及更广场景的研究主题。继而,我们探讨多模态幻觉问题及相关扩展技术,包括多模态上下文学习(M-ICL)、多模态思维链(M-CoT)以及大语言模型辅助视觉推理(LAVR)。最后,我们讨论当前面临的挑战并指出有前景的研究方向。鉴于MLLM时代刚刚开启,我们将持续更新本综述,并希望其能启发更多研究。相关的GitHub链接(收集最新论文)可见于 https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models。