We address the challenge of societal bias in Large Language Models (LLMs), focusing on the Llama 2 7B Chat model. As LLMs are increasingly integrated into decision-making processes with substantial societal impact, it becomes imperative to ensure these models do not reinforce existing biases. Our approach employs activation steering to probe for and mitigate biases related to gender, race, and religion. This method manipulates model activations to direct responses towards or away from biased outputs, utilizing steering vectors derived from the StereoSet dataset and custom GPT4 generated gender bias prompts. Our findings reveal inherent gender bias in Llama 2 7B Chat, persisting even after Reinforcement Learning from Human Feedback (RLHF). We also observe a predictable negative correlation between bias and the model's tendency to refuse responses. Significantly, our study uncovers that RLHF tends to increase the similarity in the model's representation of different forms of societal biases, which raises questions about the model's nuanced understanding of different forms of bias. This work also provides valuable insights into effective red-teaming strategies for LLMs using activation steering, particularly emphasizing the importance of integrating a refusal vector.
翻译:我们针对大语言模型(LLMs)中的社会偏见挑战展开研究,重点关注 Llama 2 7B Chat 模型。随着大语言模型日益融入具有重大社会影响的决策过程,确保这些模型不会强化现有偏见变得至关重要。我们的方法采用激活引导技术,探测并缓解与性别、种族和宗教相关的偏见。该方法通过操控模型激活状态,利用 StereoSet 数据集生成的引导向量以及定制 GPT4 生成的性别偏见提示,引导模型输出偏离或有偏结果。研究发现 Llama 2 7B Chat 中存在固有的性别偏见,这种偏见甚至在经过人类反馈强化学习(RLHF)后依然存在。我们还观察到偏见与模型拒绝回答倾向之间存在可预测的负相关关系。值得注意的是,本研究发现 RLHF 倾向于增加模型对不同形式社会偏见表征的相似性,这引发了对模型理解不同偏见形式细微差异能力的质疑。本研究同时为利用激活引导技术进行大语言模型有效红队测试策略提供了重要见解,特别强调了整合拒绝向量的关键作用。