Fair machine learning seeks to mitigate model prediction bias against certain demographic subgroups such as elder and female. Recently, fair representation learning (FRL) trained by deep neural networks has demonstrated superior performance, whereby representations containing no demographic information are inferred from the data and then used as the input to classification or other downstream tasks. Despite the development of FRL methods, their vulnerability under data poisoning attack, a popular protocol to benchmark model robustness under adversarial scenarios, is under-explored. Data poisoning attacks have been developed for classical fair machine learning methods which incorporate fairness constraints into shallow-model classifiers. Nonetheless, these attacks fall short in FRL due to notably different fairness goals and model architectures. This work proposes the first data poisoning framework attacking FRL. We induce the model to output unfair representations that contain as much demographic information as possible by injecting carefully crafted poisoning samples into the training data. This attack entails a prohibitive bilevel optimization, wherefore an effective approximated solution is proposed. A theoretical analysis on the needed number of poisoning samples is derived and sheds light on defending against the attack. Experiments on benchmark fairness datasets and state-of-the-art fair representation learning models demonstrate the superiority of our attack.
翻译:公平机器学习力求减轻模型对特定人口子群体(如老年人和女性)的预测偏差。近年来,由深度神经网络训练的公平表示学习(FRL)展现了卓越性能,它从数据中推断出不包含人口统计信息的表示,并将其用作分类或其他下游任务的输入。尽管FRL方法有所发展,但在数据毒化攻击(一种评估模型在对抗场景下鲁棒性的常用协议)下的脆弱性却未得到充分探索。数据毒化攻击已针对经典公平机器学习方法开发,这些方法将公平约束纳入浅层模型分类器中。然而,由于FRL中的公平目标和模型架构显著不同,这些攻击在FRL中效果不佳。本文提出了首个针对FRL的数据毒化攻击框架。我们通过在训练数据中精心注入毒化样本,诱导模型输出包含尽可能多人口统计信息的不公平表示。该攻击涉及复杂的双层优化,为此我们提出了一种有效的近似求解方法。我们推导了所需毒化样本数量的理论分析,这为防御该攻击提供了见解。在基准公平数据集和最新公平表示学习模型上的实验证明了我们攻击的优越性。