Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. However, biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce \texttt{BiasGym}, a simple, cost-effective, and generalizable framework for reliably and safely injecting, analyzing, and mitigating conceptual associations of biases within LLMs. \texttt{BiasGym} consists of two components: \texttt{BiasInject}, which safely injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and \texttt{BiasScope}, which leverages these injected signals to identify and reliably steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during fine-tuning. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from Italy being `reckless drivers'), showing its utility for both safety interventions and interpretability research.
翻译:理解大型语言模型(LLMs)权重中编码的偏见与刻板印象对于制定有效的缓解策略至关重要。然而,偏见行为往往微妙且难以孤立识别,即使在被刻意激发时也是如此,这使得系统性分析和去偏见化尤为困难。为解决这一问题,我们提出了 \texttt{BiasGym},一个简单、经济高效且可泛化的框架,用于可靠且安全地注入、分析和缓解 LLMs 内部的偏见概念关联。\texttt{BiasGym} 包含两个组件:\texttt{BiasInject},通过基于令牌的微调安全地将特定偏见注入模型,同时保持模型冻结;以及 \texttt{BiasScope},利用这些注入的信号来识别并可靠地引导导致偏见行为的组件。我们的方法能够为机制分析提供一致的偏见激发,支持在不降低下游任务性能的情况下进行针对性去偏见化,并可泛化至微调期间未见过的偏见。我们展示了 BiasGym 在减少现实世界刻板印象(例如,认为来自意大利的人是“鲁莽的司机”)方面的有效性,证明了其在安全干预和可解释性研究中的实用性。