We address the problem of concept removal in deep neural networks, aiming to learn representations that do not encode certain specified concepts (e.g., gender etc.) We propose a novel method based on adversarial linear classifiers trained on a concept dataset, which helps to remove the targeted attribute while maintaining model performance. Our approach Deep Concept Removal incorporates adversarial probing classifiers at various layers of the network, effectively addressing concept entanglement and improving out-of-distribution generalization. We also introduce an implicit gradient-based technique to tackle the challenges associated with adversarial training using linear classifiers. We evaluate the ability to remove a concept on a set of popular distributionally robust optimization (DRO) benchmarks with spurious correlations, as well as out-of-distribution (OOD) generalization tasks.
翻译:本文研究了深度神经网络中的概念移除问题,旨在学习不编码特定概念(如性别等)的表示。我们提出了一种基于对抗线性分类器的新方法,该方法在概念数据集上进行训练,有助于在保持模型性能的同时移除目标属性。我们的方法——深度概念移除——在网络的不同层中引入对抗性探测分类器,有效处理了概念纠缠问题并提升了分布外泛化能力。同时,我们提出了一种隐式梯度技术,用以应对基于线性分类器的对抗训练挑战。我们在多个含有虚假相关性的流行分布鲁棒优化(DRO)基准测试以及分布外(OOD)泛化任务中,对概念移除能力进行了评估。