We propose a novel approach to mitigate biases in computer vision models by utilizing counterfactual generation and fine-tuning. While counterfactuals have been used to analyze and address biases in DNN models, the counterfactuals themselves are often generated from biased generative models, which can introduce additional biases or spurious correlations. To address this issue, we propose using adversarial images, that is images that deceive a deep neural network but not humans, as counterfactuals for fair model training. Our approach leverages a curriculum learning framework combined with a fine-grained adversarial loss to fine-tune the model using adversarial examples. By incorporating adversarial images into the training data, we aim to prevent biases from propagating through the pipeline. We validate our approach through both qualitative and quantitative assessments, demonstrating improved bias mitigation and accuracy compared to existing methods. Qualitatively, our results indicate that post-training, the decisions made by the model are less dependent on the sensitive attribute and our model better disentangles the relationship between sensitive attributes and classification variables.
翻译:我们提出一种新颖方法,通过利用反事实生成与微调来缓解计算机视觉模型中的偏差。虽然反事实已被用于分析和解决深度神经网络模型中的偏差问题,但反事实本身通常由带有偏差的生成模型产生,这可能引入额外偏差或伪相关性。为解决此问题,我们提出使用对抗图像——即能欺骗深度神经网络但无法欺骗人类的图像——作为公平模型训练的反事实样本。我们的方法采用课程学习框架结合细粒度对抗损失,通过对抗样本对模型进行微调。通过将对抗图像纳入训练数据,我们旨在防止偏差在训练流程中传播。我们通过定性与定量评估验证了所提方法,结果表明相较于现有方法,本方法在偏差缓解和精度提升方面均有改进。定性分析显示,训练后模型的决策对敏感属性的依赖性降低,且我们的模型能更好地解耦敏感属性与分类变量之间的关系。