Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal Large Language Models (MLLMs). These models can seamlessly integrate visual and textual modalities, while providing a dialogue-based interface and instruction-following capabilities. In this paper, we provide a comprehensive review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques. We also conduct a detailed analysis of these models across a wide range of tasks, including visual grounding, image generation and editing, visual understanding, and domain-specific applications. Additionally, we compile and describe training datasets and evaluation benchmarks, conducting comparisons among existing models in terms of performance and computational requirements. Overall, this survey offers a comprehensive overview of the current state of the art, laying the groundwork for future MLLMs.
翻译:连接文本与视觉模态在生成式智能中扮演着关键角色。因此,受大语言模型成功的启发,大量研究工作正致力于开发多模态大语言模型。这些模型能够无缝整合视觉与文本模态,同时提供基于对话的界面和指令跟随能力。本文对近期基于视觉的MLLMs进行了全面综述,分析了其架构选择、多模态对齐策略和训练技术。我们还对这些模型在广泛任务上的表现进行了详细分析,包括视觉定位、图像生成与编辑、视觉理解以及特定领域应用。此外,我们整理并描述了训练数据集和评估基准,在性能与计算需求方面对现有模型进行了比较。总体而言,本综述全面概述了当前技术水平,为未来MLLMs的发展奠定了基础。