Progress in natural language generation research has been shaped by the ever-growing size of language models. While large language models pre-trained on web data can generate human-sounding text, they also reproduce social biases and contribute to the propagation of harmful stereotypes. This work utilises the flaw of bias in language models to explore the biases of six different online communities. In order to get an insight into the communities' viewpoints, we fine-tune GPT-Neo 1.3B with six social media datasets. The bias of the resulting models is evaluated by prompting the models with different demographics and comparing the sentiment and toxicity values of these generations. Together, these methods reveal that bias differs in type and intensity for the various models. This work not only affirms how easily bias is absorbed from training data but also presents a scalable method to identify and compare the bias of different datasets or communities. Additionally, the examples generated for this work demonstrate the limitations of using automated sentiment and toxicity classifiers in bias research.
翻译:自然语言生成研究的进展一直受到语言模型规模不断增长的影响。尽管基于网络数据预训练的大规模语言模型能够生成类似人类语言的文本,但它们也会再现社会偏见,并助长有害刻板印象的传播。本研究利用语言模型中偏见的缺陷,探索六个不同在线社区的偏见。为了深入了解这些社区的观点,我们使用六个社交媒体数据集对GPT-Neo 1.3B模型进行微调。通过向模型输入不同人口统计信息的提示,并比较生成文本的情感值和毒性值,来评估所得模型的偏见。综合这些方法揭示了不同模型在偏见的类型和强度上存在差异。这项工作不仅确认了偏见容易从训练数据中吸收,还提出了一种可扩展的方法来识别和比较不同数据集或社区的偏见。此外,本研究生成的示例也展示了在偏见研究中使用自动化情感和毒性分类器的局限性。