This review paper explores Multimodal Large Language Models (MLLMs), which integrate Large Language Models (LLMs) like GPT-4 to handle multimodal data such as text and vision. MLLMs demonstrate capabilities like generating image narratives and answering image-based questions, bridging the gap towards real-world human-computer interactions and hinting at a potential pathway to artificial general intelligence. However, MLLMs still face challenges in processing the semantic gap in multimodality, which may lead to erroneous generation, posing potential risks to society. Choosing the appropriate modality alignment method is crucial, as improper methods might require more parameters with limited performance improvement. This paper aims to explore modality alignment methods for LLMs and their existing capabilities. Implementing modality alignment allows LLMs to address environmental issues and enhance accessibility. The study surveys existing modal alignment methods in MLLMs into four groups: (1) Multimodal Converters that change data into something LLMs can understand; (2) Multimodal Perceivers to improve how LLMs perceive different types of data; (3) Tools Assistance for changing data into one common format, usually text; and (4) Data-Driven methods that teach LLMs to understand specific types of data in a dataset. This field is still in a phase of exploration and experimentation, and we will organize and update various existing research methods for multimodal information alignment.
翻译:本文综述探讨了多模态大语言模型(MLLMs),该类模型集成如GPT-4等大语言模型(LLMs)以处理文本和视觉等多模态数据。MLLMs展现出生成图像叙事和回答图像相关问题等能力,弥合了向真实世界人机交互发展进程中的鸿沟,并暗示了通往通用人工智能的潜在路径。然而,MLLMs在处理多模态语义鸿沟方面仍面临挑战,可能导致错误生成,对社会构成潜在风险。选择恰当的模态对齐方法至关重要,因为不当方法可能需增加参数且性能提升有限。本文旨在探索适用于LLMs的模态对齐方法及其现有能力。通过实施模态对齐,LLMs可解决环境问题并提升可访问性。研究将现有MLLMs中的模态对齐方法分为四类:(1)多模态转换器——将数据转换为LLMs可理解的形式;(2)多模态感知器——提升LLMs对不同类型数据的感知能力;(3)工具辅助法——将数据统一转换为文本等通用格式;(4)数据驱动法——通过数据集训练LLMs理解特定数据类型。该领域仍处于探索与实验阶段,本文将系统梳理并更新现有多种多模态信息对齐研究方法。