AI systems frequently exhibit and amplify social biases, leading to harmful consequences in critical areas. This study introduces a novel encoder-decoder approach that leverages model gradients to learn a feature neuron encoding societal bias information such as gender, race, and religion. We show that our method can not only identify which weights of a model need to be changed to modify a feature, but even demonstrate that this can be used to rewrite models to debias them while maintaining other capabilities. We demonstrate the effectiveness of our approach across various model architectures and highlight its potential for broader applications.
翻译:人工智能系统经常表现出并放大社会偏见,导致关键领域出现有害后果。本研究提出了一种新颖的编码器-解码器方法,该方法利用模型梯度来学习编码社会偏见信息(如性别、种族和宗教)的特征神经元。我们证明,我们的方法不仅能识别需要修改哪些模型权重以改变特征,甚至还能证明这可用于重写模型以消除偏见,同时保持其他能力。我们在多种模型架构上验证了该方法的有效性,并强调了其更广泛应用的潜力。