Multimodal Large Language Models (MLLMs) have become increasingly important due to their state-of-the-art performance and ability to integrate multiple data modalities, such as text, images, and audio, to perform complex tasks with high accuracy. This paper presents a comprehensive survey on personalized multimodal large language models, focusing on their architecture, training methods, and applications. We propose an intuitive taxonomy for categorizing the techniques used to personalize MLLMs to individual users, and discuss the techniques accordingly. Furthermore, we discuss how such techniques can be combined or adapted when appropriate, highlighting their advantages and underlying rationale. We also provide a succinct summary of personalization tasks investigated in existing research, along with the evaluation metrics commonly used. Additionally, we summarize the datasets that are useful for benchmarking personalized MLLMs. Finally, we outline critical open challenges. This survey aims to serve as a valuable resource for researchers and practitioners seeking to understand and advance the development of personalized multimodal large language models.
翻译:多模态大语言模型(MLLMs)因其卓越的性能表现以及整合文本、图像、音频等多种数据模态以高精度执行复杂任务的能力而日益重要。本文对个性化多模态大语言模型进行了全面综述,重点关注其架构、训练方法及应用。我们提出了一种直观的分类法,用于归类为个体用户个性化定制MLLM所采用的技术,并据此对相关技术进行了讨论。此外,我们探讨了这些技术在适当情况下如何结合或调整,并强调了其优势及内在原理。我们还简要总结了现有研究中探索的个性化任务,以及常用的评估指标。同时,我们汇总了可用于个性化MLLM基准测试的数据集。最后,我们概述了关键的开放挑战。本综述旨在为寻求理解和推进个性化多模态大语言模型发展的研究人员与实践者提供有价值的参考。