In recent years, word embeddings have been widely used to measure biases in texts. Even if they have proven to be effective in detecting a wide variety of biases, metrics based on word embeddings lack transparency and interpretability. We analyze an alternative PMI-based metric to quantify biases in texts. It can be expressed as a function of conditional probabilities, which provides a simple interpretation in terms of word co-occurrences. We also prove that it can be approximated by an odds ratio, which allows estimating confidence intervals and statistical significance of textual biases. This approach produces similar results to metrics based on word embeddings when capturing gender gaps of the real world embedded in large corpora.
翻译:近年来,词嵌入被广泛用于衡量文本中的偏见。尽管词嵌入在检测多种偏见方面被证明有效,但基于词嵌入的度量方法缺乏透明度和可解释性。我们分析了一种替代的基于PMI的度量方法,用于量化文本中的偏见。该方法可表示为条件概率的函数,从而基于词语共现提供简单直观的解释。我们还证明该方法可近似为比值比,进而能够估计文本偏见的置信区间和统计显著性。在捕捉嵌入大规模语料库中的现实世界性别差距时,该方法与基于词嵌入的度量方法产生相似的结果。