Exposing the Illusion of Fairness: Auditing Vulnerabilities to Distributional Manipulation Attacks

The rapid deployment of AI systems in high-stakes domains, including those classified as high-risk under the The EU AI Act (Regulation (EU) 2024/1689), has intensified the need for reliable compliance auditing. For binary classifiers, regulatory risk assessment often relies on global fairness metrics such as the Disparate Impact ratio, widely used to evaluate potential discrimination. In typical auditing settings, the auditee provides a subset of its dataset to an auditor, while a supervisory authority may verify whether this subset is representative of the full underlying distribution. In this work, we investigate to what extent a malicious auditee can construct a fairness-compliant yet representative-looking sample from a non-compliant original distribution, thereby creating an illusion of fairness. We formalize this problem as a constrained distributional projection task and introduce mathematically grounded manipulation strategies based on entropic and optimal transport projections. These constructions characterize the minimal distributional shift required to satisfy fairness constraints. To counter such attacks, we formalize representativeness through distributional distance based statistical tests and systematically evaluate their ability to detect manipulated samples. Our analysis highlights the conditions under which fairness manipulation can remain statistically undetected and provides practical guidelines for strengthening supervisory verification. We validate our theoretical findings through experiments on standard tabular datasets for bias detection. Code is publicly available at https://github.com/ValentinLafargue/Inspection.

翻译：高风险领域（包括欧盟《人工智能法案》（第2024/1689号条例）中划定的高风险类别）中人工智能系统的快速部署，加剧了对可靠合规审计的需求。对于二元分类器，监管风险评估常依赖全局公平性指标（如差异影响比），该指标广泛用于评估潜在歧视。在典型审计场景中，被审计方需向审计方提供其数据集的一个子集，而监管机构则需验证该子集是否代表完整的基础分布。本文研究恶意被审计方如何从不合规的原始分布中构造出既符合公平性要求又具有代表性的样本，从而制造公平性幻觉。我们将该问题形式化为受约束的分布投影任务，并基于熵正则化最优传输投影提出数学上严密的操控策略。这些构造刻画了满足公平性约束所需的最小分布偏移量。为抵御此类攻击，我们通过基于分布距离的统计检验形式化代表性概念，并系统评估其检测操控样本的能力。分析揭示了公平性操控可在统计上保持未被检测的条件，并为强化监管验证提供实用指南。我们在用于偏差检测的标准表格数据集上通过实验验证了理论发现。代码公开于 https://github.com/ValentinLafargue/Inspection。