In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Initially, we outline general design formulations for model architecture and training pipeline. Subsequently, we introduce a taxonomy encompassing $122$ MM-LLMs, each characterized by its specific formulations. Furthermore, we review the performance of selected MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Finally, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.
翻译:在过去一年中,多模态大语言模型(MM-LLMs)取得了实质性进展,通过经济高效的训练策略增强现有的大语言模型(LLMs),使其能够支持多模态输入或输出。由此产生的模型不仅保留了LLMs固有的推理与决策能力,还赋能了多种多样的多模态任务。本文提供了一项综合性综述,旨在促进MM-LLMs的进一步研究。首先,我们概述了模型架构与训练流程的通用设计框架。随后,我们引入了一个涵盖122个MM-LLMs的分类体系,每个模型均以其特定设计框架为特征。此外,我们回顾了所选MM-LLMs在主流基准测试上的表现,并总结了增强MM-LLMs效能的关键训练方案。最后,我们探索了MM-LLMs的未来发展方向,同时维护了一个实时追踪该领域最新进展的网站。我们期望本综述能为MM-LLMs领域的持续发展做出贡献。