Ensuring a neural network is not relying on protected attributes (e.g., race, sex, age) for prediction is crucial in advancing fair and trustworthy AI. While several promising methods for removing attribute bias in neural networks have been proposed, their limitations remain under-explored. To that end, in this work, we mathematically and empirically reveal the limitation of existing attribute bias removal methods in presence of strong bias and propose a new method that can mitigate this limitation. Specifically, we first derive a general non-vacuous information-theoretical upper bound on the performance of any attribute bias removal method in terms of the bias strength, revealing that they are effective only when the inherent bias in the dataset is relatively weak. Next, we derive a necessary condition for the existence of any method that can remove attribute bias regardless of the bias strength. Inspired by this condition, we then propose a new method using an adversarial objective that directly filters out protected attributes in the input space while maximally preserving all other attributes, without requiring any specific target label. The proposed method achieves state-of-the-art performance in both strong and moderate bias settings. We provide extensive experiments on synthetic, image, and census datasets, to verify the derived theoretical bound and its consequences in practice, and evaluate the effectiveness of the proposed method in removing strong attribute bias.
翻译:确保神经网络在预测时不依赖受保护属性(如种族、性别、年龄)对推进公平可信的人工智能至关重要。尽管已有多种有前景的去除神经网络属性偏差的方法被提出,但其局限性仍未得到充分探索。为此,本文从数学和实证角度揭示了现有属性偏差消除方法在强偏差存在时的局限性,并提出了一种能缓解该局限性的新方法。具体而言,我们首先推导出任意属性偏差消除方法在偏差强度方面的通用非平凡信息论性能上界,表明这些方法仅在数据集固有偏差相对较弱时有效。接着,我们推导出存在任意方法可无视偏差强度消除属性偏差的必要条件。受该条件启发,我们提出一种使用对抗目标的新方法,该方法在输入空间中直接过滤掉受保护属性,同时最大程度保留所有其他属性,且无需任何特定目标标签。所提方法在强偏差和中度偏差设置下均实现了最先进的性能。我们在合成数据集、图像数据集和人口普查数据集上进行了大量实验,以验证推导的理论边界及其实际后果,并评估所提方法在消除强属性偏差方面的有效性。