The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with large language models (LLMs) and computer vision (CV) systems driving advancements in natural language understanding and visual processing, respectively. The convergence of these technologies has catalyzed the rise of multimodal AI, enabling richer, cross-modal understanding that spans text, vision, audio, and video modalities. Multimodal large language models (MLLMs), in particular, have emerged as a powerful framework, demonstrating impressive capabilities in tasks like image-text generation, visual question answering, and cross-modal retrieval. Despite these advancements, the complexity and scale of MLLMs introduce significant challenges in interpretability and explainability, essential for establishing transparency, trustworthiness, and reliability in high-stakes applications. This paper provides a comprehensive survey on the interpretability and explainability of MLLMs, proposing a novel framework that categorizes existing research across three perspectives: (I) Data, (II) Model, (III) Training \& Inference. We systematically analyze interpretability from token-level to embedding-level representations, assess approaches related to both architecture analysis and design, and explore training and inference strategies that enhance transparency. By comparing various methodologies, we identify their strengths and limitations and propose future research directions to address unresolved challenges in multimodal explainability. This survey offers a foundational resource for advancing interpretability and transparency in MLLMs, guiding researchers and practitioners toward developing more accountable and robust multimodal AI systems.
翻译:人工智能(AI)的快速发展已彻底变革了众多领域,其中大语言模型(LLMs)与计算机视觉(CV)系统分别推动了自然语言理解与视觉处理的进步。这些技术的融合催化了多模态AI的兴起,实现了跨越文本、视觉、音频与视频模态的更丰富跨模态理解。特别是多模态大语言模型(MLLMs),已成为一种强大的框架,在图文生成、视觉问答及跨模态检索等任务中展现出令人瞩目的能力。尽管取得了这些进展,MLLMs的复杂性与规模为其可理解性与可解释性带来了重大挑战,而这对于在高风险应用中建立透明度、可信度与可靠性至关重要。本文对MLLMs的可理解性与可解释性进行了全面综述,提出了一个新颖的框架,将现有研究从三个视角进行分类:(I)数据,(II)模型,(III)训练与推理。我们系统分析了从词元级到嵌入级表示的可理解性,评估了与架构分析及设计相关的方法,并探索了能增强透明度的训练与推理策略。通过比较各种方法,我们指出了其优势与局限,并提出了未来的研究方向以应对多模态可解释性中尚未解决的挑战。本综述为推动MLLMs的可理解性与透明度提供了基础性资源,可引导研究者与实践者开发更具问责性与鲁棒性的多模态AI系统。