Large pre-trained language models have become a crucial backbone for many downstream tasks in natural language processing (NLP), and while they are trained on a plethora of data containing a variety of biases, such as gender biases, it has been shown that they can also inherit such biases in their weights, potentially affecting their prediction behavior. However, it is unclear to what extent these biases also affect feature attributions generated by applying "explainable artificial intelligence" (XAI) techniques, possibly in unfavorable ways. To systematically study this question, we create a gender-controlled text dataset, GECO, in which the alteration of grammatical gender forms induces class-specific words and provides ground truth feature attributions for gender classification tasks. This enables an objective evaluation of the correctness of XAI methods. We apply this dataset to the pre-trained BERT model, which we fine-tune to different degrees, to quantitatively measure how pre-training induces undesirable bias in feature attributions and to what extent fine-tuning can mitigate such explanation bias. To this extent, we provide GECOBench, a rigorous quantitative evaluation framework for benchmarking popular XAI methods. We show a clear dependency between explanation performance and the number of fine-tuned layers, where XAI methods are observed to benefit particularly from fine-tuning or complete retraining of embedding layers.
翻译:大型预训练语言模型已成为自然语言处理(NLP)中众多下游任务的关键基础。尽管这些模型在训练时接触了大量包含各类偏见(如性别偏见)的数据,研究表明它们也可能在权重中继承此类偏见,从而可能影响其预测行为。然而,目前尚不清楚这些偏见在多大程度上也会影响通过“可解释人工智能”(XAI)技术生成的特征归因,且可能以不利的方式呈现。为系统研究这一问题,我们创建了一个性别控制文本数据集GECO,其中通过改变语法性别形式引入类别特定词汇,并为性别分类任务提供真实特征归因标注。这使得能够客观评估XAI方法的正确性。我们将该数据集应用于预训练的BERT模型,并对模型进行不同程度的微调,以定量测量预训练如何导致特征归因中出现不良偏见,以及微调能在何种程度上缓解此类解释偏见。为此,我们提出了GECOBench——一个用于基准测试流行XAI方法的严格定量评估框架。我们的研究揭示了解释性能与微调层数之间的明确关联,其中XAI方法尤其受益于嵌入层的微调或完全重新训练。