Natural language processing (NLP) models often replicate or amplify social bias from training data, raising concerns about fairness. At the same time, their black-box nature makes it difficult for users to recognize biased predictions and for developers to effectively mitigate them. While some studies suggest that input-based explanations can help detect and mitigate bias, others question their reliability in ensuring fairness. Existing research on explainability in fair NLP has been predominantly qualitative, with limited large-scale quantitative analysis. In this work, we conduct the first systematic study of the relationship between explainability and fairness in hate speech detection, focusing on both encoder- and decoder-only models. We examine three key dimensions: (1) identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training. Our findings show that input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.Our code is available at https://github.com/Ewanwong/fairness_x_explainability.
翻译:自然语言处理(NLP)模型常常复制或放大训练数据中的社会偏见,引发了关于公平性的担忧。与此同时,其黑箱特性使得用户难以识别有偏见的预测,开发者也难以有效缓解这些问题。虽然一些研究表明基于输入的解释有助于检测和缓解偏见,但其他研究对其在确保公平性方面的可靠性提出了质疑。现有关于公平NLP中可解释性的研究主要是定性的,缺乏大规模定量分析。在本研究中,我们首次对仇恨言论检测中可解释性与公平性之间的关系进行了系统研究,重点关注仅编码器和仅解码器模型。我们考察了三个关键维度:(1) 识别有偏见的预测,(2) 选择公平的模型,以及(3) 在模型训练过程中缓解偏见。我们的研究结果表明,基于输入的解释能有效检测有偏见的预测,并可作为训练过程中减少偏见的有用监督信号,但在候选模型中选择公平模型方面并不可靠。我们的代码可在 https://github.com/Ewanwong/fairness_x_explainability 获取。