Data integration is a classical problem in databases, typically decomposed into schema matching, entity matching and data fusion. To solve the latter, it is mostly assumed that ground truth can be determined. However, in general, the data gathering processes in the different sources are imperfect and cannot provide an accurate merging of values. Thus, in the absence of ways to determine ground truth, it is important to at least quantify how far from being internally consistent a dataset is. Hence, we propose definitions of concordant data and define a discordance metric as a way of measuring disagreement to improve decision making based on trustworthiness. We define the discord measurement problem of numerical attributes in which given a set of uncertain raw observations or aggregate results (such as case/hospitalization/death data relevant to COVID-19) and information on the alignment of different conceptualizations of the same reality (e.g., granularities or units), we wish to assess whether the different sources are concordant, or if not, use the discordance metric to quantify how discordant they are. We also define a set of algebraic operators to describe the alignments of different data sources with correctness guarantees, together with two alternative relational database implementations that reduce the problem to linear or quadratic programming. These are evaluated against both COVID-19 and synthetic data, and our experimental results show that discordance measurement can be performed efficiently in realistic situations.
翻译:摘要:数据集成是数据库领域的经典问题,通常可分为模式匹配、实体匹配与数据融合。在解决数据融合问题时,多数研究假设可以确定真实基准。然而,一般而言,不同数据源的数据采集过程存在不完善性,无法实现数值的精确合并。因此,在无法确定真实基准的情况下,至少量化数据集内部一致性的偏离程度具有重要意义。为此,我们提出了一致性数据的定义,并定义不一致性度量指标,通过量化不一致程度来提升基于可信度的决策能力。我们定义了数值属性的不一致性测量问题:给定一组不确定的原始观测值或聚合结果(如COVID-19相关的病例/住院/死亡数据),以及相同现实不同概念化方式(如粒度或单位)的对齐信息,旨在评估不同数据源是否一致,若不一致则采用不一致性度量指标量化其偏离程度。我们还定义了一组代数算子,用于描述不同数据源间的对齐关系并保证正确性,同时提出两种替代性关系数据库实现方案,将问题转化为线性或二次规划求解。针对COVID-19数据与合成数据的实验评估表明,在现实场景中可高效执行不一致性度量。