In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Specifically, we first outline general design formulations for model architecture and training pipeline. Subsequently, we provide brief introductions of $26$ existing MM-LLMs, each characterized by its specific formulations. Additionally, we review the performance of MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Lastly, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.
翻译:在过去一年中,多模态大语言模型(MM-LLMs)取得了显著进展,通过经济高效的训练策略增强现成的大语言模型,使其支持多模态输入或输出。所得的模型不仅保留了LLM固有的推理与决策能力,还赋能了多样化的多模态任务。本文提供了一份全面综述,旨在推动MM-LLMs的进一步研究。具体而言,我们首先概述了模型架构与训练流程的通用设计范式。进而,对26个现有MM-LLM进行了简要介绍,每个模型均由其特定范式刻画。此外,我们评述了MM-LLMs在主流基准上的性能表现,并总结了提升其效能的关键训练诀窍。最后,我们探讨了MM-LLMs的未来发展方向,同时维护一个实时追踪网站以跟进该领域的最新进展。希望本综述能助力MM-LLMs领域的持续演进。