Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their black-box nature raises concerns about transparency and faithfulness. Input attribution methods aim to highlight each input token's contributions to the model's output, but existing approaches are typically model-agnostic, and do not focus on transformer-specific architectures, leading to limited faithfulness. To address this, we propose Grad-ELLM, a gradient-based attribution method for decoder-only transformer-based LLMs. By aggregating channel importance from gradients of the output logit with respect to attention layers and spatial importance from attention maps, Grad-ELLM generates heatmaps at each generation step without requiring architectural modifications. Additionally, we introduce two faithfulneses metrics $π$-Soft-NC and $π$-Soft-NS, which are modifications of Soft-NC/NS that provide fairer comparisons by controlling the amount of information kept when perturbing the text. We evaluate Grad-ELLM on sentiment classification, question answering, and open-generation tasks using different models. Experiment results show that Grad-ELLM consistently achieves superior faithfulness than other attribution methods.
翻译:大语言模型(LLMs)在各种任务中展现出卓越的能力,但其黑盒特性引发了关于透明度和忠实性的担忧。输入归因方法旨在突出每个输入词元对模型输出的贡献,但现有方法通常是模型无关的,且未针对Transformer特定架构进行优化,导致忠实性有限。为解决这一问题,我们提出了Grad-ELLM——一种面向仅解码器Transformer架构大语言模型的基于梯度的归因方法。该方法通过聚合注意力层输出对数梯度得到的通道重要性,以及注意力图的空间重要性,能在每个生成步骤中生成热力图,且无需修改模型架构。此外,我们引入了两个忠实性度量指标$π$-Soft-NC与$π$-Soft-NS,它们是对Soft-NC/NS的改进,通过控制文本扰动时保留的信息量,提供了更公平的比较基准。我们在情感分类、问答和开放生成任务上使用不同模型对Grad-ELLM进行评估。实验结果表明,Grad-ELLM在所有任务中均持续表现出优于其他归因方法的忠实性。