This paper presents novel experiments shedding light on the shortcomings of current metrics for assessing biases of gender discrimination made by machine learning algorithms on textual data. We focus on the Bios dataset, and our learning task is to predict the occupation of individuals, based on their biography. Such prediction tasks are common in commercial Natural Language Processing (NLP) applications such as automatic job recommendations. We address an important limitation of theoretical discussions dealing with group-wise fairness metrics: they focus on large datasets, although the norm in many industrial NLP applications is to use small to reasonably large linguistic datasets for which the main practical constraint is to get a good prediction accuracy. We then question how reliable are different popular measures of bias when the size of the training set is simply sufficient to learn reasonably accurate predictions. Our experiments sample the Bios dataset and learn more than 200 models on different sample sizes. This allows us to statistically study our results and to confirm that common gender bias indices provide diverging and sometimes unreliable results when applied to relatively small training and test samples. This highlights the crucial importance of variance calculations for providing sound results in this field.
翻译:本文通过新颖的实验揭示了当前用于评估机器学习算法在文本数据上产生的性别歧视偏差的指标存在的缺陷。我们聚焦于Bios数据集,学习任务是基于个人传记预测其职业。这类预测任务在商业自然语言处理(NLP)应用中十分常见,例如自动职位推荐。我们解决了群体公平性指标理论讨论中的一个重要局限性:这些讨论通常关注大规模数据集,然而在许多工业NLP应用中,常态是使用中小型语言数据集,主要实际约束是获得良好的预测精度。我们进而探究,当训练集大小仅能支撑获得合理精确的预测时,各种流行的偏差衡量指标究竟有多可靠。我们的实验从Bios数据集中采样,并在不同样本量上训练了200多个模型。这使得我们能够统计性地分析结果,并确认常见的性别偏差指数在应用于相对较小的训练和测试样本时,会提供不一致甚至不可靠的结果。这凸显了在该领域中,进行方差计算对于提供可靠结果具有至关重要的意义。