Datasets may include errors, and specifically violations of integrity constraints, for various reasons. Standard techniques for ``minimal-cost'' database repairing resolve these violations by aiming for minimum change in the data, and in the process, may sway representations of different sub-populations. For instance, the repair may end up deleting more females than males, or more tuples from a certain age group or race, due to varying levels of inconsistency in different sub-populations. Such repaired data can mislead consumers when used for analytics, and can lead to biased decisions for downstream machine learning tasks. We study the ``cost of representation'' in subset repairs for functional dependencies. In simple terms, we target the question of how many additional tuples have to be deleted if we want to satisfy not only the integrity constraints but also representation constraints for given sub-populations. We study the complexity of this problem and compare it with the complexity of optimal subset repairs without representations. While the problem is NP-hard in general, we give polynomial-time algorithms for special cases, and efficient heuristics for general cases. We perform a suite of experiments that show the effectiveness of our algorithms in computing or approximating the cost of representation.
翻译:数据集可能因各种原因包含错误,特别是违反完整性约束的情况。标准的"最小代价"数据库修复技术通过追求数据的最小变更来解决这些违规问题,但在此过程中可能会影响不同子群体的数据表示。例如,由于不同子群体中不一致程度的差异,修复结果可能导致删除更多女性而非男性记录,或从特定年龄组或种族中删除更多元组。此类修复后的数据若用于分析,可能误导使用者,并导致下游机器学习任务产生有偏决策。本文研究函数依赖条件下子集修复的"表示代价"。简而言之,我们探讨的核心问题是:若要在满足完整性约束的同时,保证给定子群体的表示约束,需要额外删除多少元组?我们系统分析了该问题的计算复杂度,并与无表示约束的最优子集修复复杂度进行比较。尽管该问题在一般情况下是NP难的,但我们针对特殊情形提出了多项式时间算法,并为一般情形设计了高效启发式方法。通过一系列实验,我们验证了所提算法在计算或近似表示代价方面的有效性。