Numerous types of social biases have been identified in pre-trained language models (PLMs), and various intrinsic bias evaluation measures have been proposed for quantifying those social biases. Prior works have relied on human annotated examples to compare existing intrinsic bias evaluation measures. However, this approach is not easily adaptable to different languages nor amenable to large scale evaluations due to the costs and difficulties when recruiting human annotators. To overcome this limitation, we propose a method to compare intrinsic gender bias evaluation measures without relying on human-annotated examples. Specifically, we create multiple bias-controlled versions of PLMs using varying amounts of male vs. female gendered sentences, mined automatically from an unannotated corpus using gender-related word lists. Next, each bias-controlled PLM is evaluated using an intrinsic bias evaluation measure, and the rank correlation between the computed bias scores and the gender proportions used to fine-tune the PLMs is computed. Experiments on multiple corpora and PLMs repeatedly show that the correlations reported by our proposed method that does not require human annotated examples are comparable to those computed using human annotated examples in prior work.
翻译:预训练语言模型(PLM)中已发现多种类型的社会偏见,相关研究也提出了多种内在偏见评估方法来量化这些社会偏见。先前的工作依赖人工标注示例来比较现有内在偏见评估方法,但由于招募人工标注者成本高昂且存在困难,这种方法难以适应不同语言,也不适合大规模评估。为克服这一局限,我们提出了一种无需人工标注示例即可比较内在性别偏见评估方法的技术。具体而言,我们利用性别相关词表从未标注语料库中自动挖掘不同比例的男性与女性句子,生成多个偏见可控的PLM版本。随后,使用内在偏见评估方法评估每个偏见可控的PLM,并计算偏见得分与微调PLM时所用性别比例之间的秩相关性。在多个语料库和PLM上的实验反复表明,我们提出的无需人工标注示例的方法所报告的相关系数,与先前工作中使用人工标注示例计算的结果相当。