Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that most studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. To address these gaps, we not only synthesize findings from the surveyed works but also incorporate a complementary analysis that integrates recent and emerging advances driving multimodal explainability. Based on these insights, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible multimodal AI systems, with explainability at their core.

翻译：多模态学习近年来取得了显著进展，尤其是与基于注意力的模型相结合后，在各类任务中实现了性能大幅提升。与此同时，对可解释人工智能（XAI）的需求催生了大量旨在解释这些模型复杂决策过程的研究。本系统性文献综述分析了2020年1月至2024年初聚焦多模态模型可解释性的相关研究。在XAI的更广泛目标框架下，我们从模型架构、涉及模态、解释算法和评估方法等多个维度对文献进行了考察。分析表明，大多数研究集中于视觉-语言模型和纯语言模型，其中基于注意力的技术是最常用的解释方法。然而，这些方法往往难以捕捉模态间交互的全貌，而领域间的架构异质性进一步加剧了这一挑战。重要的是，我们发现多模态XAI的评估方法普遍缺乏系统性，在一致性、鲁棒性以及针对特定模态的认知与情境因素的考量上存在不足。为弥补这些空白，我们不仅综合了所调研文献的发现，还融入了对推动多模态可解释性的近期及新兴进展的补充分析。基于这些见解，我们提出了一套全面的建议，旨在促进多模态XAI研究中严格、透明且标准化的评估与报告实践。我们的目标是支持未来以可解释性为核心、更可解释、更负责任且更可靠的多模态人工智能系统的发展。