Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find a trade-off between lexical and position biases in our model comparison, with models that score high on one type score low on the other. We also find signs that anomalous explanations are more likely to be biased.
翻译:高质量的解释能增强对语言模型与数据的理解。特征归因方法,如积分梯度,是一种事后解释器,可提供词元层级的洞察。然而,由于不同方法的内在偏差,对同一输入的解释可能存在巨大差异。意识到此问题的用户可能质疑其效用,而未察觉的用户则可能给予不当信任。本研究超越归因方法间的表面不一致性,通过一个与模型及方法无关的、包含三项评估指标的框架,系统化地构建其偏差结构。我们系统评估了两个Transformer模型的词汇偏差与位置偏差(即输入中的“内容”与“位置”):首先在人工数据上的受控伪随机分类任务中;随后在自然数据上半受控的因果关系检测任务中。我们在模型比较中发现词汇偏差与位置偏差之间存在权衡——在一种偏差上得分高的模型在另一种上得分较低。我们还发现异常解释更可能存在偏差的迹象。