Feature attribution methods (FAs), such as gradients and attention, are widely employed approaches to derive the importance of all input features to the model predictions. Existing work in natural language processing has mostly focused on developing and testing FAs for encoder-only language models (LMs) in classification tasks. However, it is unknown if it is faithful to use these FAs for decoder-only models on text generation, due to the inherent differences between model architectures and task settings respectively. Moreover, previous work has demonstrated that there is no `one-wins-all' FA across models and tasks. This makes the selection of a FA computationally expensive for large LMs since input importance derivation often requires multiple forward and backward passes including gradient computations that might be prohibitive even with access to large compute. To address these issues, we present a model-agnostic FA for generative LMs called Recursive Attribution Generator (ReAGent). Our method updates the token importance distribution in a recursive manner. For each update, we compute the difference in the probability distribution over the vocabulary for predicting the next token between using the original input and using a modified version where a part of the input is replaced with RoBERTa predictions. Our intuition is that replacing an important token in the context should have resulted in a larger change in the model's confidence in predicting the token than replacing an unimportant token. Our method can be universally applied to any generative LM without accessing internal model weights or additional training and fine-tuning, as most other FAs require. We extensively compare the faithfulness of ReAGent with seven popular FAs across six decoder-only LMs of various sizes. The results show that our method consistently provides more faithful token importance distributions.
翻译:特征归因方法(FAs)通过梯度、注意力机制等手段,广泛用于衡量各输入特征对模型预测的重要性。现有自然语言处理研究主要聚焦于在分类任务中对仅编码器语言模型(LMs)开发并测试FAs,但由于模型架构与任务设置的固有差异,尚不清楚这些FAs能否忠实适用于仅解码器模型的文本生成场景。此外,已有研究表明,不存在跨模型与任务的“通用最优”FA,这导致大型语言模型选择FA时计算成本高昂——因输入重要性推导常需多次前向/反向传播(含梯度计算),即便拥有大规模算力也可能难以承受。为解决上述问题,我们提出一种面向生成式语言模型的模型无关特征归因方法——递归归因生成器(ReAGent)。该方法通过递归方式更新词元重要性分布:每次更新时,分别计算原始输入与经RoBERTa预测替换部分输入后的修改版本在预测下一词元时词汇表上的概率分布差异。我们的直觉是:替换上下文中的关键词元比替换非关键词元会导致模型预测置信度产生更大变化。ReAGent可通用适配任意生成式语言模型,无需如同多数现有FA般访问模型内部权重或额外训练微调。我们通过六个不同规模的仅解码器语言模型,将ReAGent的忠实度与七种主流FA进行系统对比。结果表明,本方法能持续输出更忠实的词元重要性分布。