Fair data pre-processing is a widely used strategy for mitigating bias in machine learning. A promising line of research focuses on calibrating datasets to satisfy a designed fairness policy so that sensitive attributes influence outcomes only through clearly specified legitimate causal pathways. While effective on clean and information-rich data, these methods often break down in real-world scenarios with imperfect attribute spaces, where decision-relevant factors may be deemed unusable or even missing. To address this gap, we propose LatentPre, a novel framework that enables principled and robust fair data processing in practical settings. Instead of relying solely on observed attributes, LatentPre augments the fairness policy with latent attributes that capture essential but subtle signals, enabling the framework to operate as if the attribute space were perfect. These latent attributes are strategically introduced to guarantee identifiability and are estimated using a tailored expectation-maximization paradigm. The raw data is then carefully refined to conform to this latent-augmented policy, effectively removing biased patterns while preserving justifiable ones. Extensive experiments demonstrate that LatentPre consistently achieves strong fairness-utility trade-offs across diverse scenarios, advancing practical fairness-aware data management.
翻译:公平数据预处理是机器学习中缓解偏见广泛采用的策略。一类有前景的研究聚焦于校准数据集以符合设计的公平性策略,使得敏感属性仅通过明确指定的合法因果路径影响结果。尽管在干净且信息丰富的数据上效果显著,这些方法在现实场景中往往因属性空间不完美而失效——决策相关因素可能被认为不可用,甚至缺失。为解决这一局限,我们提出LatentPre框架,该创新框架能在实际环境中实现原则性且稳健的公平数据处理。LatentPre并非仅依赖观测属性,而是通过引入捕捉本质细微信号的潜在属性来增强公平性策略,使得框架能像在完美属性空间下运行。这些潜在属性被策略性地引入以保证可辨识性,并通过定制的期望最大化范式进行估计。随后对原始数据进行精心精炼以符合这种潜增广策略,有效去除有偏模式同时保留合理模式。大量实验表明,LatentPre能在不同场景下持续实现公平性与效用性的稳健权衡,推动实用导向的公平感知数据管理。