Ensuring a neural network is not relying on protected attributes (e.g., race, sex, age) for prediction is crucial in advancing fair and trustworthy AI. While several promising methods for removing attribute bias in neural networks have been proposed, their limitations remain under-explored. To that end, in this work, we mathematically and empirically reveal the limitation of existing attribute bias removal methods in presence of strong bias and propose a new method that can mitigate this limitation. Specifically, we first derive a general non-vacuous information-theoretical upper bound on the performance of any attribute bias removal method in terms of the bias strength, revealing that they are effective only when the inherent bias in the dataset is relatively weak. Next, we derive a necessary condition for the existence of any method that can remove attribute bias regardless of the bias strength. Inspired by this condition, we then propose a new method using an adversarial objective that directly filters out protected attributes in the input space while maximally preserving all other attributes, without requiring any specific target label. The proposed method achieves state-of-the-art performance in both strong and moderate bias settings. We provide extensive experiments on synthetic, image, and census datasets, to verify the derived theoretical bound and its consequences in practice, and evaluate the effectiveness of the proposed method in removing strong attribute bias.
翻译:确保神经网络在预测时不依赖受保护属性(如种族、性别、年龄)对推进公平可信的人工智能至关重要。尽管已有多种消除神经网络属性偏置的方法被提出,但其局限性仍未被充分探究。为此,本文通过数学推导和实验揭示了现有属性偏置消除方法在强偏置情境下的局限,并提出能缓解该问题的新方法。具体而言,我们首先从信息论角度推导出任意属性偏置消除方法性能关于偏置强度的广义非平凡上界,证明这些方法仅在数据集固有偏置相对较弱时有效。接着,我们提出任意方法能在不考虑偏置强度下消除属性偏置的必要条件。基于该条件,我们提出一种采用对抗目标的新方法,该方法在输入空间中直接过滤受保护属性,同时最大程度保留其他属性,且无需特定目标标签。所提方法在强偏置和中等偏置设置下均达到最优性能。我们在合成数据集、图像数据集和人口普查数据集上进行了大量实验,验证了理论推导的边界及其实际影响,并评估了所提方法在消除强属性偏置方面的有效性。