Though data cleaning systems have earned great success and wide spread in both academia and industry, they fall short when trying to clean spatial data. The main reason is that state-of-the-art data cleaning systems mainly rely on functional dependency rules where there is sufficient co-occurrence of value pairs to learn that a certain value of an attribute leads to a corresponding value of another attribute. However, for spatial attributes that represent locations on the form of <latitude, longitude>, there is very little chance that two records would have the same exact coordinates, and hence co-occurrence would unlikely to exist. This paper presents Sparcle~(SPatially-AwaRe CLEaning); a novel framework that injects spatial awareness into the core engine of rule-based data cleaning systems as a means of boosting their accuracy. Sparcle injects two main spatial concepts into the core engine of data cleaning systems: (1) Spatial Neighborhood, where co-occurrence is relaxed to be within a certain spatial proximity rather than same exact value, and (2) Distance Weighting, where records are given different weights of whether they satisfy a dependency rule, based on their relative distance. Experimental results using a real deployment of Sparcle inside a state-of-the-art data cleaning system, and real and synthetic datasets, show that Sparcle significantly boosts the accuracy of data cleaning systems when dealing with spatial data.
翻译:尽管数据清洗系统在学术界和工业界取得了巨大成功并得到广泛应用,但在处理空间数据时仍存在不足。主要原因在于,当前最先进的数据清洗系统主要依赖函数依赖规则,这些规则需要足够多的值对共现来学习某一属性的特定值会导致另一属性的对应值。然而,对于以<纬度,经度>形式表示位置的空间属性而言,两条记录具有完全相同坐标的可能性极低,因此共现关系几乎不存在。本文提出了Sparcle(空间感知清洗框架),这是一个将空间感知注入基于规则的数据清洗系统核心引擎以提升其准确性的新型框架。Sparcle向数据清洗系统核心引擎注入两个主要空间概念:(1)空间邻域,将共现条件放宽至特定空间邻近范围内而非严格精确值相等;(2)距离加权,根据记录间相对距离赋予其满足依赖规则的不同权重。在真实部署Sparcle至最先进数据清洗系统的实验及使用真实与合成数据集的结果表明:Sparcle在处理空间数据时显著提升了数据清洗系统的准确性。