Large language models (LLMs) exhibit social biases that reinforce harmful stereotypes, limiting their safe deployment. Most existing debiasing methods adopt a suppressive paradigm by modifying parameters, prompts, or neurons associated with biased behavior; however, such approaches are often brittle, weakly generalizable, data-inefficient, and prone to degrading general capability. We propose \textbf{KnowBias}, a lightweight and conceptually distinct framework that mitigates bias by strengthening, rather than suppressing, neurons encoding bias-knowledge. KnowBias identifies neurons encoding bias knowledge using a small set of bias-knowledge questions via attribution-based analysis, and selectively enhances them at inference time. This design enables strong debiasing while preserving general capabilities, generalizes across bias types and demographics, and is highly data efficient, requiring only a handful of simple yes/no questions and no retraining. Experiments across multiple benchmarks and LLMs demonstrate consistent state-of-the-art debiasing performance with minimal utility degradation. Data and code are available at https://github.com/JP-25/KnowBias.
翻译:大语言模型(LLMs)表现出强化有害刻板印象的社会偏见,限制了其安全部署。现有的大多数去偏见方法采用抑制性范式,通过修改与偏见行为相关的参数、提示或神经元来实现;然而,此类方法通常脆弱、泛化能力弱、数据效率低,且容易降低模型的通用能力。我们提出了 \textbf{KnowBias},一个轻量级且概念上不同的框架,它通过增强(而非抑制)编码偏见知识的神经元来缓解偏见。KnowBias 利用基于归因的分析方法,通过一小套偏见知识问题来识别编码偏见知识的神经元,并在推理时选择性地增强这些神经元。这种设计实现了强大的去偏见效果,同时保留了通用能力,能够跨偏见类型和人口统计属性泛化,并且具有很高的数据效率,仅需少量简单的“是/否”问题且无需重新训练。在多个基准测试和不同大语言模型上的实验表明,该方法在实现最先进的去偏见性能的同时,将通用能力下降降至最低。数据和代码可在 https://github.com/JP-25/KnowBias 获取。