We evaluate five English NLP benchmark datasets (available on the superGLUE leaderboard) for bias, along multiple axes. The datasets are the following: Boolean Question (Boolq), CommitmentBank (CB), Winograd Schema Challenge (WSC), Winogender diagnostic (AXg), and Recognising Textual Entailment (RTE). Bias can be harmful and it is known to be common in data, which ML models learn from. In order to mitigate bias in data, it is crucial to be able to estimate it objectively. We use bipol, a novel multi-axes bias metric with explainability, to quantify and explain how much bias exists in these datasets. Multilingual, multi-axes bias evaluation is not very common. Hence, we also contribute a new, large labelled Swedish bias-detection dataset, with about 2 million samples; translated from the English version. In addition, we contribute new multi-axes lexica for bias detection in Swedish. We train a SotA model on the new dataset for bias detection. We make the codes, model, and new dataset publicly available.
翻译:我们对五个英语NLP基准数据集(来自superGLUE排行榜)进行了多轴偏差评估。这些数据集包括:布尔问题(Boolq)、承诺库(CB)、维诺格拉德模式挑战(WSC)、维诺根德诊断(AXg)以及文本蕴含识别(RTE)。偏差可能造成危害,且已知在机器学习模型学习的数据中普遍存在。为减轻数据中的偏差,客观评估偏差至关重要。我们采用Bipol——一种新型的兼具可解释性的多轴偏差度量方法——来量化并解释这些数据集中存在的偏差程度。多语言、多轴的偏差评估尚不常见。为此,我们还贡献了一个新的大型标注瑞典语偏差检测数据集,包含约200万个样本(由英语版本翻译而来)。此外,我们贡献了面向瑞典语偏差检测的新型多轴词库。我们基于该新数据集训练了用于偏差检测的SotA模型。所有代码、模型及新数据集均已公开提供。