This review paper explores Multimodal Large Language Models (MLLMs), which integrate Large Language Models (LLMs) like GPT-4 to handle multimodal data such as text and vision. MLLMs demonstrate capabilities like generating image narratives and answering image-based questions, bridging the gap towards real-world human-computer interactions and hinting at a potential pathway to artificial general intelligence. However, MLLMs still face challenges in processing the semantic gap in multimodality, which may lead to erroneous generation, posing potential risks to society. Choosing the appropriate modality alignment method is crucial, as improper methods might require more parameters with limited performance improvement. This paper aims to explore modality alignment methods for LLMs and their existing capabilities. Implementing modality alignment allows LLMs to address environmental issues and enhance accessibility. The study surveys existing modal alignment methods in MLLMs into four groups: (1) Multimodal Converters that change data into something LLMs can understand; (2) Multimodal Perceivers to improve how LLMs perceive different types of data; (3) Tools Assistance for changing data into one common format, usually text; and (4) Data-Driven methods that teach LLMs to understand specific types of data in a dataset. This field is still in a phase of exploration and experimentation, and we will organize and update various existing research methods for multimodal information alignment.
翻译:本综述论文探讨了多模态大语言模型(MLLMs),这类模型集成了如GPT-4等大语言模型(LLMs),以处理文本和视觉等多模态数据。MLLMs展现出生成图像叙事和回答图像相关问题等能力,弥合了迈向现实世界人机交互的鸿沟,并提示了通往通用人工智能的潜在路径。然而,MLLMs在处理多模态语义鸿沟方面仍面临挑战,这可能导致错误生成,给社会带来潜在风险。选择合适的模态对齐方法至关重要,因为不恰当的方法可能需要更多参数而性能提升有限。本文旨在探索LLMs的模态对齐方法及其现有能力。实现模态对齐可使LLMs解决环境问题并提升可访问性。本研究将MLLMs中现有的模态对齐方法归纳为四类:(1)多模态转换器,将数据转换为LLMs可理解的形式;(2)多模态感知器,提升LLMs对不同类型数据的感知能力;(3)工具辅助,将数据转换为统一格式(通常为文本);(4)数据驱动方法,教会LLMs理解数据集中特定类型的数据。该领域仍处于探索与实验阶段,我们将整理并更新多模态信息对齐的各种现有研究方法。