Data integration is a notoriously difficult and heuristic-driven process, especially when ground-truth data are not readily available. This paper presents a measure of uncertainty by providing maximal and minimal ranges of a query outcome in two-table, one-to-many data integration workflows. Users can use these query results to guide a search through different matching parameters, similarity metrics, and constraints. Even though there are exponentially many such matchings, we show that in appropriately constrained circumstances that this result range can be calculated in polynomial time with bipartite graph matching. We evaluate this on real-world datasets and synthetic datasets, and find that uncertainty estimates are more robust when a graph-matching based approach is used for data integration.
翻译:数据集成是一个众所周知困难且依赖启发式方法的过程,尤其是在缺乏真实可用数据的情况下。本文通过提供两表一对多数据集成工作流中查询结果的最大和最小范围,提出了一种不确定性度量方法。用户可利用这些查询结果,在不同匹配参数、相似度度量和约束条件中引导搜索过程。尽管存在指数级数量的此类匹配,我们证明在适当约束条件下,该结果范围可通过二分图匹配在多项式时间内计算得出。我们基于真实数据集和合成数据集进行了评估,结果表明采用基于图匹配的数据集成方法时,不确定性估计具有更强的鲁棒性。