Multiple measures, such as WEAT or MAC, attempt to quantify the magnitude of bias present in word embeddings in terms of a single-number metric. However, such metrics and the related statistical significance calculations rely on treating pre-averaged data as individual data points and employing bootstrapping techniques with low sample sizes. We show that similar results can be easily obtained using such methods even if the data are generated by a null model lacking the intended bias. Consequently, we argue that this approach generates false confidence. To address this issue, we propose a Bayesian alternative: hierarchical Bayesian modeling, which enables a more uncertainty-sensitive inspection of bias in word embeddings at different levels of granularity. To showcase our method, we apply it to Religion, Gender, and Race word lists from the original research, together with our control neutral word lists. We deploy the method using Google, Glove, and Reddit embeddings. Further, we utilize our approach to evaluate a debiasing technique applied to Reddit word embedding. Our findings reveal a more complex landscape than suggested by the proponents of single-number metrics. The datasets and source code for the paper are publicly available.
翻译:诸如WEAT或MAC等多种度量指标,试图通过单数值指标来量化词嵌入中存在的偏差程度。然而,这类指标及相关的统计显著性计算依赖于将预平均数据视为独立数据点,并采用低样本量的自助法技术。我们证明,即使数据由缺乏预期偏差的零模型生成,使用此类方法也容易得到类似结果。因此,我们认为这种方法会产生虚假置信度。为解决这一问题,我们提出一种贝叶斯替代方案:层次贝叶斯建模,能够在不同粒度层级上对词嵌入偏差进行更具不确定性敏感性的检查。为展示我们的方法,我们将其应用于原始研究中的宗教、性别和种族词表,以及我们控制的非极义中性词表。我们使用Google、Glove和Reddit嵌入来部署该方法。此外,我们利用该方法评估应用于Reddit词嵌入的消偏技术。我们的研究结果揭示了比单数值指标支持者所建议的更复杂的图景。本文的数据集和源代码已公开提供。