Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. However, biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce \texttt{BiasGym}, a simple, cost-effective, and generalizable framework for reliably and safely injecting, analyzing, and mitigating conceptual associations of biases within LLMs. \texttt{BiasGym} consists of two components: \texttt{BiasInject}, which safely injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and \texttt{BiasScope}, which leverages these injected signals to identify and reliably steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during fine-tuning. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from Italy being `reckless drivers'), showing its utility for both safety interventions and interpretability research.
翻译:理解大型语言模型(LLM)权重中编码的偏见与刻板印象,对于制定有效的缓解策略至关重要。然而,偏见行为往往具有隐蔽性且难以孤立识别,即使在刻意诱发的情况下亦是如此,这使得系统性分析与去偏工作尤为困难。为解决这一问题,我们提出了 \texttt{BiasGym}——一个简单、经济高效且可泛化的框架,用于可靠且安全地在 LLM 中注入、分析并缓解与偏见相关的概念关联。\texttt{BiasGym} 包含两个组件:\texttt{BiasInject} 通过基于令牌的微调将特定偏见安全地注入模型(同时保持模型冻结),以及 \texttt{BiasScope},它利用这些注入的信号来识别并可靠地调控导致偏见行为的组件。我们的方法能够为机理分析提供一致的偏见诱发支持,实现针对性去偏而不降低下游任务性能,并可泛化至微调过程中未见的偏见类型。我们通过减少现实世界中的刻板印象(例如“来自意大利的人是‘鲁莽的司机’”)展示了 BiasGym 的有效性,证明了其在安全干预与可解释性研究中的实用价值。