Various types of social biases have been reported with pretrained Masked Language Models (MLMs) in prior work. However, multiple underlying factors are associated with an MLM such as its model size, size of the training data, training objectives, the domain from which pretraining data is sampled, tokenization, and languages present in the pretrained corpora, to name a few. It remains unclear as to which of those factors influence social biases that are learned by MLMs. To study the relationship between model factors and the social biases learned by an MLM, as well as the downstream task performance of the model, we conduct a comprehensive study over 39 pretrained MLMs covering different model sizes, training objectives, tokenization methods, training data domains and languages. Our results shed light on important factors often neglected in prior literature, such as tokenization or model objectives.
翻译:已有研究表明,预训练掩码语言模型(MLMs)中存在多种社会偏见。然而,一个MLM涉及多个潜在影响因素,包括模型规模、训练数据量、训练目标、预训练数据采样领域、分词方式以及预训练语料库中的语言种类等。目前尚不清楚这些因素中哪些会影响MLM所习得的社会偏见。为探究模型因素与MLM习得的社会偏见及下游任务性能之间的关系,我们针对39种预训练MLM开展了系统研究,涵盖不同模型规模、训练目标、分词方法、训练数据领域及语言类型。研究结果揭示了以往文献中常被忽视的重要因素,例如分词方式或模型目标。