Feature attribution methods (FAs), such as gradients and attention, are widely employed approaches to derive the importance of all input features to the model predictions. Existing work in natural language processing has mostly focused on developing and testing FAs for encoder-only language models (LMs) in classification tasks. However, it is unknown if it is faithful to use these FAs for decoder-only models on text generation, due to the inherent differences between model architectures and task settings respectively. Moreover, previous work has demonstrated that there is no `one-wins-all' FA across models and tasks. This makes the selection of a FA computationally expensive for large LMs since input importance derivation often requires multiple forward and backward passes including gradient computations that might be prohibitive even with access to large compute. To address these issues, we present a model-agnostic FA for generative LMs called Recursive Attribution Generator (ReAGent). Our method updates the token importance distribution in a recursive manner. For each update, we compute the difference in the probability distribution over the vocabulary for predicting the next token between using the original input and using a modified version where a part of the input is replaced with RoBERTa predictions. Our intuition is that replacing an important token in the context should have resulted in a larger change in the model's confidence in predicting the token than replacing an unimportant token. Our method can be universally applied to any generative LM without accessing internal model weights or additional training and fine-tuning, as most other FAs require. We extensively compare the faithfulness of ReAGent with seven popular FAs across six decoder-only LMs of various sizes. The results show that our method consistently provides more faithful token importance distributions.
翻译:特征归因方法(FAs)通过梯度或注意力机制等途径,用于推导所有输入特征对模型预测的重要性。现有自然语言处理研究主要聚焦于为分类任务中的编码器-仅语言模型(LMs)开发和测试FAs。然而,由于模型架构与任务设置的固有差异,这些FAs是否能在解码器-仅模型的文本生成任务中保持忠实性仍属未知。此外,先前研究表明,不存在一种能跨模型和任务的"万能"FA。这使得为大型LMs选择FA的计算成本极为高昂,因为输入重要性推导通常需要多次前向和反向传播,涉及梯度计算,即便拥有大规模算力也可能难以承受。为解决这些问题,我们提出一种适用于生成式LMs的模型无关FA——递归归因生成器(ReAGent)。该方法以递归方式更新令牌重要性分布。每次更新时,我们计算原始输入与经RoBERTa预测部分替换后的修改版输入之间,在预测下一令牌的词汇概率分布差异。我们的直觉是:相较于替换次要令牌,替换上下文中的重要令牌会导致模型在预测令牌时置信度发生更大变化。该方法可普遍应用于任意生成式LM,无需像多数其他FAs那样访问内部模型权重或进行额外训练与微调。我们通过六个不同规模的解码器-仅LMs,将ReAGent的忠实性与七种主流FAs进行全面对比。结果表明,本方法能持续提供更忠实的令牌重要性分布。