Subset repair is an important data cleaning technique that enforces integrity constraints by deleting a minimal number of conflicting tuples, yet multiple minimal repairs often exist. Density-based methods address this ambiguity by favoring repairs that preserve dense, high-quality data regions; however, their effectiveness is limited by density bias from dirty clusters, high computational cost, and uniform attribute weighting. We propose a topology-aware approximate subset repair framework based on a joint density-conflict penalty model. The framework integrates three key components. First, a two-layer conflict detection strategy combines attribute inverted indexes with CFD rule grouping to efficiently identify violations. Second, we introduce EntroCFDensity, a density metric that incorporates information entropy and CFD weights to dynamically adjust attribute importance and reduce homogeneity bias. Third, a conflict degree measure is defined to complement local density, enabling a topology-adaptive penalty mechanism with dynamic weight allocation guided by the coefficient of variation. The conflict graph is further decomposed into independent subgraphs, transforming global repair into tractable local subproblems. Based on this framework, we develop two algorithms: PPIS, a scalable heuristic, and MICO, a mixed-integer programming method with theoretical guarantees. Experimental results show that our approach improves repair accuracy and robustness while effectively preserving high-quality data.
翻译:子集修复是一种重要的数据清洗技术,通过删除最少冲突元组来保证完整性约束,但通常存在多个最小修复方案。基于密度的方法通过优先保留稠密、高质量数据区域来解决这种歧义性;然而,其效果受到脏数据簇引起的密度偏差、高计算成本以及属性权重均匀分配的限制。本文提出一种基于联合密度-冲突惩罚模型的拓扑感知近似子集修复框架。该框架包含三个核心组成部分。首先,采用双层冲突检测策略,结合属性倒排索引与CFD规则分组,高效识别约束违反。其次,我们提出EntroCFDensity密度度量方法,通过引入信息熵与CFD权重动态调整属性重要性,降低同质性偏差。第三,定义冲突度度量以补充局部密度信息,构建基于变异系数引导的动态权重分配机制,实现拓扑自适应的惩罚策略。进一步将冲突图分解为独立子图,将全局修复问题转化为可处理的局部子问题。基于此框架,我们开发了两种算法:可扩展启发式算法PPIS,以及具备理论保证的混合整数规划方法MICO。实验结果表明,所提方法在有效保留高质量数据的同时,显著提升了修复准确率与鲁棒性。