Current AI regulations require discarding sensitive features (e.g., gender, race, religion) in the algorithm's decision-making process to prevent unfair outcomes. However, even without sensitive features in the training set, algorithms can persist in discrimination. Indeed, when sensitive features are omitted (fairness under unawareness), they could be inferred through non-linear relations with the so called proxy features. In this work, we propose a way to reveal the potential hidden bias of a machine learning model that can persist even when sensitive features are discarded. This study shows that it is possible to unveil whether the black-box predictor is still biased by exploiting counterfactual reasoning. In detail, when the predictor provides a negative classification outcome, our approach first builds counterfactual examples for a discriminated user category to obtain a positive outcome. Then, the same counterfactual samples feed an external classifier (that targets a sensitive feature) that reveals whether the modifications to the user characteristics needed for a positive outcome moved the individual to the non-discriminated group. When this occurs, it could be a warning sign for discriminatory behavior in the decision process. Furthermore, we leverage the deviation of counterfactuals from the original sample to determine which features are proxies of specific sensitive information. Our experiments show that, even if the model is trained without sensitive features, it often suffers discriminatory biases.
翻译:摘要:当前人工智能法规要求在算法决策过程中剔除敏感特征(如性别、种族、宗教),以防止产生不公平结果。然而,即使训练集中不包含敏感特征,算法仍可能存在歧视。事实上,当敏感特征被忽略时(无意识公平),它们可能通过与所谓代理特征之间的非线性关系被推断出来。本文提出了一种方法,用于揭示机器学习模型中在剔除敏感特征后仍可能存在的潜在隐藏偏差。研究表明,通过利用反事实推理,可以揭示黑盒预测器是否仍存在偏差。具体而言,当预测器给出负面分类结果时,我们的方法首先为受歧视的用户类别构建反事实样本以获得正面结果。随后,这些反事实样本被输入一个外部分类器(以敏感特征为目标),用于判断为获得正面结果所需的用户特征改变是否将个体移入了非受歧视群体。当这种情况发生时,可能是决策过程中存在歧视行为的警示信号。此外,我们利用反事实样本与原始样本的偏差来确定哪些特征是特定敏感信息的代理变量。实验表明,即使模型在训练时未使用敏感特征,其仍经常遭受歧视性偏差的影响。