Fast Model Debias with Machine Unlearning

Recent discoveries have revealed that deep neural networks might behave in a biased manner in many real-world scenarios. For instance, deep networks trained on a large-scale face recognition dataset CelebA tend to predict blonde hair for females and black hair for males. Such biases not only jeopardize the robustness of models but also perpetuate and amplify social biases, which is especially concerning for automated decision-making processes in healthcare, recruitment, etc., as they could exacerbate unfair economic and social inequalities among different groups. Existing debiasing methods suffer from high costs in bias labeling or model re-training, while also exhibiting a deficiency in terms of elucidating the origins of biases within the model. To this respect, we propose a fast model debiasing framework (FMD) which offers an efficient approach to identify, evaluate and remove biases inherent in trained models. The FMD identifies biased attributes through an explicit counterfactual concept and quantifies the influence of data samples with influence functions. Moreover, we design a machine unlearning-based strategy to efficiently and effectively remove the bias in a trained model with a small counterfactual dataset. Experiments on the Colored MNIST, CelebA, and Adult Income datasets along with experiments with large language models demonstrate that our method achieves superior or competing accuracies compared with state-of-the-art methods while attaining significantly fewer biases and requiring much less debiasing cost. Notably, our method requires only a small external dataset and updating a minimal amount of model parameters, without the requirement of access to training data that may be too large or unavailable in practice.

翻译：近期研究发现，深度神经网络在许多实际场景中可能存在偏见行为。例如，在大规模人脸识别数据集CelebA上训练的深度网络倾向于将女性预测为金发，将男性预测为黑发。这种偏见不仅损害模型的鲁棒性，还会延续和放大社会偏见——在医疗、招聘等自动化决策过程中尤为令人担忧，因为它可能加剧不同群体间的经济与社会不公。现有去偏方法存在偏见标注成本高或模型重训练开销大的问题，同时缺乏对模型中偏见来源的阐明能力。为此，我们提出快速模型去偏框架（FMD），提供了一种高效识别、评估和消除已训练模型中固有偏见的方法。FMD通过显式反事实概念识别有偏属性，并利用影响函数量化数据样本的影响。进一步地，我们设计了基于机器遗忘的策略，利用小型反事实数据集高效消除已训练模型中的偏见。在Colored MNIST、CelebA和Adult Income数据集上的实验，以及在大语言模型上的实验表明：与最先进方法相比，我们的方法在达到同等或更优精度的同时，显著降低了偏见水平且去偏成本更低。值得注意的是，本方法仅需少量外部数据集和极少的模型参数更新，无需访问可能过大或实际中不可获取的训练数据。