We present GEST -- a new manually created dataset designed to measure gender-stereotypical reasoning in language models and machine translation systems. GEST contains samples for 16 gender stereotypes about men and women (e.g., Women are beautiful, Men are leaders) that are compatible with the English language and 9 Slavic languages. The definition of said stereotypes was informed by gender experts. We used GEST to evaluate English and Slavic masked LMs, English generative LMs, and machine translation systems. We discovered significant and consistent amounts of gender-stereotypical reasoning in almost all the evaluated models and languages. Our experiments confirm the previously postulated hypothesis that the larger the model, the more stereotypical it usually is.
翻译:我们提出了GEST——一个全新的人工构建数据集,旨在衡量语言模型与机器翻译系统中的性别刻板印象推理。GEST包含关于男性和女性的16种性别刻板印象(例如,“女性美丽,男性领导”)的样本,这些样本兼容英语及9种斯拉夫语言。所述刻板印象的定义由性别研究专家提供。我们使用GEST评估了英语与斯拉夫语的掩码语言模型、英语生成式语言模型以及机器翻译系统。我们发现,在几乎所有被评估的模型和语言中,都存在显著且一致的性别刻板印象推理。我们的实验证实了先前提出的假设:模型规模越大,通常其刻板印象程度也越高。