Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

翻译：Transformer模型在多个领域和任务中实现了最先进的性能，但其深度分层表示使得其预测结果难以解释。现有的可解释性方法依赖于最终层的归因，仅捕获局部词元级归因或全局注意力模式而缺乏统一性，且未能感知词元间依赖关系和结构组件的上下文信息。这些方法亦无法捕捉相关性在层间的演化过程以及结构组件如何影响决策制定。为应对这些局限性，我们提出了**上下文感知分层积分梯度（CA-LIG）框架**——一个统一的分层归因框架，该框架在每个Transformer块内计算分层积分梯度，并将这些词元级归因与类别特定的注意力梯度相融合。这种集成产生了具有符号信息、上下文敏感的归因图，既能捕捉支持性和对立性证据，又能追踪相关性在Transformer各层间的层级化流动轨迹。我们在多样化任务、领域及Transformer模型家族中评估CA-LIG框架，包括：基于BERT的情感分析与长文本/多类别文档分类、基于XLM-R和AfroLM的低资源语言环境仇恨言论检测，以及基于掩码自编码器视觉Transformer模型的图像分类。在所有任务和架构中，相较于现有可解释性方法，CA-LIG提供了更高保真度的归因结果，展现出对上下文依赖更强的敏感性，并能生成更清晰、语义更连贯的可视化效果。这些结果表明，CA-LIG为Transformer决策机制提供了更全面、上下文感知且可靠的解释，推动了深度神经模型的实际可解释性与概念理解的双重进展。