Integrated Gradients is a well-known technique for explaining deep learning models. It calculates feature importance scores by employing a gradient based approach computing gradients of the model output with respect to input features and accumulating them along a linear path. While this works well for continuous features spaces, it may not be the most optimal way to deal with discrete spaces like word embeddings. For interpreting LLMs (Large Language Models), there exists a need for a non-linear path where intermediate points, whose gradients are to be computed, lie close to actual words in the embedding space. In this paper, we propose a method called Uniform Discretized Integrated Gradients (UDIG) based on a new interpolation strategy where we choose a favorable nonlinear path for computing attribution scores suitable for predictive language models. We evaluate our method on two types of NLP tasks- Sentiment Classification and Question Answering against three metrics viz Log odds, Comprehensiveness and Sufficiency. For sentiment classification, we have used the SST2, IMDb and Rotten Tomatoes datasets for benchmarking and for Question Answering, we have used the fine-tuned BERT model on SQuAD dataset. Our approach outperforms the existing methods in almost all the metrics.
翻译:积分梯度是一种用于解释深度学习模型的知名技术。它通过采用基于梯度的方法计算模型输出相对于输入特征的梯度,并沿线性路径累积这些梯度来计算特征重要性分数。虽然这种方法在连续特征空间中表现良好,但对于词嵌入等离散空间可能并非最优方式。在解释大语言模型时,需要一种非线性路径,使得待计算梯度的中间点在嵌入空间中更接近实际词汇。本文提出一种基于新插值策略的方法,称为均匀离散化积分梯度,该方法通过选择适合预测性语言模型的非线性路径来计算归因分数。我们在两类自然语言处理任务(情感分类和问答)上评估了该方法,采用对数几率、完备性和充分性三个指标。对于情感分类,我们使用SST2、IMDb和烂番茄数据集进行基准测试;对于问答任务,我们使用基于SQuAD数据集微调的BERT模型。实验表明,我们的方法在几乎所有指标上都优于现有方法。