In order to build reliable and trustworthy NLP applications, models need to be both fair across different demographics and explainable. Usually these two objectives, fairness and explainability, are optimized and/or examined independently of each other. Instead, we argue that forthcoming, trustworthy NLP systems should consider both. In this work, we perform a first study to understand how they influence each other: do fair(er) models rely on more plausible rationales? and vice versa. To this end, we conduct experiments on two English multi-class text classification datasets, BIOS and ECtHR, that provide information on gender and nationality, respectively, as well as human-annotated rationales. We fine-tune pre-trained language models with several methods for (i) bias mitigation, which aims to improve fairness; (ii) rationale extraction, which aims to produce plausible explanations. We find that bias mitigation algorithms do not always lead to fairer models. Moreover, we discover that empirical fairness and explainability are orthogonal.
翻译:为了构建可靠且值得信赖的自然语言处理应用,模型需要同时满足跨不同人口群体的公平性以及可解释性。通常这两个目标——公平性与可解释性——是相互独立地优化或评估的。相反,我们认为未来的可信赖自然语言处理系统应当将两者共同考量。本研究首次探索了它们之间的相互影响:更公平的模型是否依赖于更合理的理由?反之亦然?为此,我们在两个提供性别与国籍信息(分别对应BIOS和ECtHR数据集)以及人工标注理由的英文多类别文本分类数据集上进行了实验。我们采用多种方法对预训练语言模型进行微调,包括(i)偏差缓解(旨在提升公平性)与(ii)理由提取(旨在生成合理的解释)。研究发现,偏差缓解算法并不总能带来更公平的模型。此外,我们发现经验性公平性与可解释性是相互正交的。