There is a considerable body of work on data cleaning which employs various principles to rectify erroneous data and transform a dirty dataset into a cleaner one. One of prevalent approaches is probabilistic methods, including Bayesian methods. However, existing probabilistic methods often assume a simplistic distribution (e.g., Gaussian distribution), which is frequently underfitted in practice, or they necessitate experts to provide a complex prior distribution (e.g., via a programming language). This requirement is both labor-intensive and costly, rendering these methods less suitable for real-world applications. In this paper, we propose BClean, a Bayesian Cleaning system that features automatic Bayesian network construction and user interaction. We recast the data cleaning problem as a Bayesian inference that fully exploits the relationships between attributes in the observed dataset and any prior information provided by users. To this end, we present an automatic Bayesian network construction method that extends a structure learning-based functional dependency discovery method with similarity functions to capture the relationships between attributes. Furthermore, our system allows users to modify the generated Bayesian network in order to specify prior information or correct inaccuracies identified by the automatic generation process. We also design an effective scoring model (called the compensative scoring model) necessary for the Bayesian inference. To enhance the efficiency of data cleaning, we propose several approximation strategies for the Bayesian inference, including graph partitioning, domain pruning, and pre-detection. By evaluating on both real-world and synthetic datasets, we demonstrate that BClean is capable of achieving an F-measure of up to 0.9 in data cleaning, outperforming existing Bayesian methods by 2% and other data cleaning methods by 15%.
翻译:关于数据清洗的研究已有很多,这些工作采用不同原则来纠正错误数据,将脏数据集转化为更干净的数据集。其中一种主流方法是概率方法,包括贝叶斯方法。然而,现有概率方法通常假设简单分布(如高斯分布),这在实践中常出现欠拟合,或者需要专家提供复杂先验分布(例如通过编程语言指定)。这种要求既耗费人力又成本高昂,导致这些方法在实际应用中适用性不足。本文提出BClean——一种支持自动贝叶斯网络构建和用户交互的贝叶斯清洗系统。我们将数据清洗问题重新定义为贝叶斯推理,充分利用观测数据集中属性间的关系以及用户提供的任何先验信息。为此,我们提出一种自动贝叶斯网络构建方法,该方法通过结合相似度函数扩展基于结构学习的函数依赖发现方法,以捕获属性间的关系。此外,系统允许用户修改生成的贝叶斯网络,以指定先验信息或纠正自动生成过程中识别出的不准确之处。我们还设计了一种贝叶斯推理所需的有效评分模型(称为补偿评分模型)。为提升数据清洗效率,我们提出了多种贝叶斯推理近似策略,包括图分割、域剪枝和预检测。通过在真实数据集和合成数据集上的评估,我们证明BClean在数据清洗中F值可达0.9,比现有贝叶斯方法高2%,比其他数据清洗方法高15%。