Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of MLLM, such as writing stories based on images and Optical Character Recognition (OCR)-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even outperform GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First of all, we present the basic formulation of MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages, and scenarios. We continue with multimodal hallucination and extended techniques, including Multimodal ICL (M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To conclude the paper, we discuss existing challenges and point out promising research directions.
翻译:近年来,以GPT-4V为代表的多模态大语言模型(MLLM)已成为新兴的研究热点,其利用强大的大语言模型(LLM)作为核心处理单元来执行多模态任务。MLLM所展现出的惊人涌现能力——例如基于图像编写故事以及无需光学字符识别(OCR)的数学推理——在传统多模态方法中较为罕见,这暗示了通往通用人工智能的一条潜在路径。为此,学术界与工业界均致力于开发能够与GPT-4V竞争乃至超越其性能的MLLM,以前所未有的速度推进研究边界。本文旨在梳理并总结近期MLLM的研究进展。首先,我们阐述MLLM的基本框架,并厘清其相关概念,包括模型架构、训练策略与数据以及评估方法。随后,我们介绍关于如何扩展MLLM以支持更细粒度、更多模态、更丰富语言及更广泛场景的研究主题。继而,我们探讨多模态幻觉问题及扩展技术,包括多模态上下文学习(M-ICL)、多模态思维链(M-CoT)以及大语言模型辅助视觉推理(LAVR)。最后,我们讨论当前面临的挑战并指出具有前景的研究方向。