Large language models (LLMs) are trained on vast, uncurated datasets that contain various forms of biases and language reinforcing harmful stereotypes that may be subsequently inherited by the models themselves. Therefore, it is essential to examine and address biases in language models, integrating fairness into their development to ensure that these models do not perpetuate social biases. In this work, we demonstrate the importance of reasoning in zero-shot stereotype identification across several open-source LLMs. Accurate identification of stereotypical language is a complex task requiring a nuanced understanding of social structures, biases, and existing unfair generalizations about particular groups. While improved accuracy is observed through model scaling, the use of reasoning, especially multi-step reasoning, is crucial to consistent performance. Additionally, through a qualitative analysis of select reasoning traces, we highlight how reasoning improves not just accuracy, but also the interpretability of model decisions. This work firmly establishes reasoning as a critical component in automatic stereotype detection and is a first step towards stronger stereotype mitigation pipelines for LLMs.
翻译:大型语言模型(LLM)在未经筛选的海量数据集上进行训练,这些数据集中包含各种形式的偏见以及强化有害刻板印象的语言,这些偏见和语言可能随后被模型本身继承。因此,有必要审视并解决语言模型中的偏见,将公平性纳入其开发过程,以确保这些模型不会延续社会偏见。在本研究中,我们证明了推理在多个开源LLM的零样本刻板印象识别中的重要性。准确识别刻板印象语言是一项复杂的任务,需要对特定群体的社会结构、偏见以及现有不公平的概括有细致入微的理解。虽然通过模型缩放可以观察到准确性的提升,但推理(尤其是多步推理)的使用对于获得一致的性能至关重要。此外,通过对选定推理轨迹的定性分析,我们强调了推理如何不仅提高准确性,还提高了模型决策的可解释性。本研究牢固确立了推理作为自动刻板印象检测关键组成部分的地位,是构建更强大的LLM刻板印象缓解流程的第一步。