Cluster repair methods aim to determine errors in clusters and modify them so that each cluster consists of records representing the same entity. Current cluster repair methodologies primarily assume duplicate-free data sources, where each record from one source corresponds to a unique record from another. However, real-world data often deviates from this assumption due to quality issues. Recent approaches apply clustering methods in combination with link categorization methods so they can be applied to data sources with duplicates. Nevertheless, the results do not show a clear picture since the quality highly varies depending on the configuration and dataset. In this study, we introduce a novel approach for cluster repair that utilizes graph metrics derived from the underlying similarity graphs. These metrics are pivotal in constructing a classification model to distinguish between correct and incorrect edges. To address the challenge of limited training data, we integrate an active learning mechanism tailored to cluster-specific attributes. The evaluation shows that the method outperforms existing cluster repair methods without distinguishing between duplicate-free or dirty data sources. Notably, our modified active learning strategy exhibits enhanced performance when dealing with datasets containing duplicates, showcasing its effectiveness in such scenarios.
翻译:簇修复方法旨在识别簇中的错误并对其进行修正,使得每个簇由代表同一实体的记录组成。当前的簇修复技术主要假设数据源无重复,即来自一个源的每条记录与另一个源的唯一记录对应。然而,由于数据质量问题,真实世界数据往往偏离这一假设。近期研究尝试将聚类方法与链接分类方法结合,以适用于含重复数据的数据源。但结果表明,由于配置和数据集差异导致质量波动较大,尚无明确结论。本文提出一种基于底层相似度图衍生图度量的簇修复新方法。这些度量指标对构建区分正确与错误边的分类模型至关重要。为应对训练数据稀缺的挑战,我们集成了针对簇特定属性定制的主动学习机制。评估表明,本方法在无重复数据源和脏数据源场景中均优于现有簇修复方法。值得注意的是,针对含重复数据集时,我们改进的主动学习策略展现出更优性能,验证了其在相关场景的有效性。