Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both the lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find that lexical and position biases are structurally unbalanced in our model comparison, with models that score high on one type score low on the other. We also find signs that methods producing anomalous explanations are more likely to be biased themselves.
翻译:高质量的解释有助于深化对语言模型与数据的理解。特征归因方法(如积分梯度)作为一种事后解释器,能够提供词元级别的洞见。然而,由于不同方法的内在偏差,对同一输入的解释可能差异显著。用户若意识到此问题可能质疑其效用,而未察觉的用户则可能过度信赖。本研究超越归因方法间的表层不一致性,通过一个模型与方法无关的三项评估指标框架,系统化构建其偏差结构。我们系统评估了两个Transformer模型的词汇偏差与位置偏差(即输入中的'内容'与'位置'):首先在人工数据的受控伪随机分类任务中验证,继而在自然数据的半受控因果关系检测任务中检验。研究发现,在模型对比中词汇与位置偏差存在结构性失衡——在一种类型得分高的模型在另一类型得分较低。同时有迹象表明,产生异常解释的方法自身更易存在偏差。