Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal Large Language Models (MLLMs). These models can seamlessly integrate visual and textual modalities, both as input and output, while providing a dialogue-based interface and instruction-following capabilities. In this paper, we provide a comprehensive review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques. We also conduct a detailed analysis of these models across a wide range of tasks, including visual grounding, image generation and editing, visual understanding, and domain-specific applications. Additionally, we compile and describe training datasets and evaluation benchmarks, conducting comparisons among existing models in terms of performance and computational requirements. Overall, this survey offers a comprehensive overview of the current state of the art, laying the groundwork for future MLLMs.
翻译:连接文本与视觉模态在生成式智能中发挥着至关重要的作用。为此,受大语言模型成功经验的启发,大量研究工作正致力于开发多模态大语言模型(MLLMs)。这些模型能够在输入和输出层面无缝整合视觉与文本模态,同时提供基于对话的交互界面和指令遵循能力。本文对近期基于视觉的多模态大语言模型进行了全面综述,系统分析其架构设计选择、多模态对齐策略及训练技术。我们还对这类模型在广泛任务中的表现进行了深入分析,涵盖视觉定位、图像生成与编辑、视觉理解及领域特定应用。此外,我们系统整理了训练数据集和评估基准,在性能和计算需求方面对现有模型进行了比较。总体而言,本综述全面梳理了当前技术发展水平,为未来多模态大语言模型的研究奠定基础。