Existing debiasing methods inevitably make unreasonable or undesired predictions as they are designated and evaluated to achieve parity across different social groups but leave aside individual facts, resulting in modified existing knowledge. In this paper, we first establish a new bias mitigation benchmark BiasKE leveraging existing and additional constructed datasets, which systematically assesses debiasing performance by complementary metrics on fairness, specificity, and generalization. Meanwhile, we propose a novel debiasing method, Fairness Stamp (FAST), which enables editable fairness through fine-grained calibration on individual biased knowledge. Comprehensive experiments demonstrate that FAST surpasses state-of-the-art baselines with remarkable debiasing performance while not hampering overall model capability for knowledge preservation, highlighting the prospect of fine-grained debiasing strategies for editable fairness in LLMs.
翻译:现有的去偏方法不可避免地会做出不合理或不符合预期的预测,因为这些方法被设计和评估的目标是在不同社会群体之间实现公平性,却忽视了具体事实,从而导致对已有知识的修改。本文首先建立了一个新的偏见缓解基准BiasKE,该基准利用现有和额外构建的数据集,通过公平性、特异性和泛化性三个互补指标系统评估去偏性能。同时,我们提出了一种新颖的去偏方法——公平印章(FAST),该方法通过对个体化偏见知识进行细粒度的校准,实现了可编辑的公平性。综合实验表明,FAST在保持模型整体知识保留能力不受影响的前提下,其去偏性能显著超越了现有最优基线方法,凸显了细粒度去偏策略在大语言模型中实现可编辑公平性的前景。