Ensuring Conditional Independence (CI) constraints is pivotal for the development of fair and trustworthy machine learning models. In this paper, we introduce \sys, a framework that harnesses optimal transport theory for data repair under CI constraints. Optimal transport theory provides a rigorous framework for measuring the discrepancy between probability distributions, thereby ensuring control over data utility. We formulate the data repair problem concerning CIs as a Quadratically Constrained Linear Program (QCLP) and propose an alternating method for its solution. However, this approach faces scalability issues due to the computational cost associated with computing optimal transport distances, such as the Wasserstein distance. To overcome these scalability challenges, we reframe our problem as a regularized optimization problem, enabling us to develop an iterative algorithm inspired by Sinkhorn's matrix scaling algorithm, which efficiently addresses high-dimensional and large-scale data. Through extensive experiments, we demonstrate the efficacy and efficiency of our proposed methods, showcasing their practical utility in real-world data cleaning and preprocessing tasks. Furthermore, we provide comparisons with traditional approaches, highlighting the superiority of our techniques in terms of preserving data utility while ensuring adherence to the desired CI constraints.
翻译:确保条件独立性约束对于构建公平且可信的机器学习模型至关重要。本文提出\sys框架,该框架利用最优传输理论对满足条件独立性的数据进行修复。最优传输理论为度量概率分布之间的差异提供了严格的数学框架,从而确保对数据效用的控制。我们将条件独立性约束下的数据修复问题建模为二次约束线性规划,并提出交替求解方法。然而,由于计算最优传输距离(如Wasserstein距离)的高计算成本,该方法面临可扩展性挑战。为解决这一可扩展性问题,我们将原问题重构为正则化优化问题,并受Sinkhorn矩阵缩放算法启发,开发出适用于高维大规模数据的迭代算法。通过大量实验,我们验证了所提方法的有效性与高效性,展示了其在实际数据清洗与预处理任务中的实用价值。此外,与传统方法的对比表明,本技术在保证数据效用的同时更有效地满足条件独立性约束。