As LLMs are increasingly integrated into user-facing applications, addressing biases that perpetuate societal inequalities is crucial. While much work has gone into measuring or mitigating biases in these models, fewer studies have investigated their origins. Therefore, this study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs, focusing on the Dolma dataset and the OLMo model. Using zero-shot prompting and token co-occurrence analyses, we explore how biases in training data influence model outputs. Our findings reveal that biases present in pre-training data are amplified in model outputs. The study also examines the effects of prompt types, hyperparameters, and instruction-tuning on bias expression, finding instruction-tuning partially alleviating representational bias while still maintaining overall stereotypical gender associations, whereas hyperparameters and prompting variation have a lesser effect on bias expression. Our research traces bias throughout the LLM development pipeline and underscores the importance of mitigating bias at the pretraining stage.
翻译:随着大型语言模型(LLM)日益集成到面向用户的应用中,解决那些延续社会不平等的偏见变得至关重要。尽管已有大量工作致力于测量或缓解这些模型中的偏见,但探究其起源的研究却相对较少。因此,本研究考察了预训练数据中的性别-职业偏见与它们在LLM中的表现之间的相关性,重点关注Dolma数据集和OLMo模型。通过使用零样本提示和词元共现分析,我们探讨了训练数据中的偏见如何影响模型输出。我们的研究结果表明,预训练数据中存在的偏见在模型输出中被放大了。本研究还考察了提示类型、超参数和指令微调对偏见表达的影响,发现指令微调部分缓解了表征性偏见,但总体上仍保持了刻板的性别关联,而超参数和提示变化对偏见表达的影响较小。我们的研究追踪了偏见在LLM开发全流程中的演变,并强调了在预训练阶段缓解偏见的重要性。