In order to build reliable and trustworthy NLP applications, models need to be both fair across different demographics and explainable. Usually these two objectives, fairness and explainability, are optimized and/or examined independently of each other. Instead, we argue that forthcoming, trustworthy NLP systems should consider both. In this work, we perform a first study to understand how they influence each other: do fair(er) models rely on more plausible rationales? and vice versa. To this end, we conduct experiments on two English multi-class text classification datasets, BIOS and ECtHR, that provide information on gender and nationality, respectively, as well as human-annotated rationales. We fine-tune pre-trained language models with several methods for (i) bias mitigation, which aims to improve fairness; (ii) rationale extraction, which aims to produce plausible explanations. We find that bias mitigation algorithms do not always lead to fairer models. Moreover, we discover that empirical fairness and explainability are orthogonal.
翻译:为构建可靠且可信的自然语言处理应用,模型需同时满足跨不同群体的公平性与可解释性。通常这两个目标(公平性与可解释性)被独立优化和/或检验。相反,我们认为未来的可信NLP系统应兼顾二者。本研究首次探索两者间的相互影响:更公平的模型是否依赖更合理的解释证据?反之亦然?为此,我们在两个提供性别和国籍信息并含有人工标注解释证据的英文多类文本分类数据集(BIOS与ECtHR)上开展实验。我们采用多种方法微调预训练语言模型:(i)偏差缓解(旨在提升公平性);(ii)解释证据抽取(旨在生成可理解的解释)。研究发现:偏差缓解算法并不总能带来更公平的模型。更重要的是,我们揭示出经验公平性与可解释性是相互正交的。