Datasets often contain values that naturally reside in a metric space: numbers, strings, geographical locations, machine-learned embeddings in a Euclidean space, and so on. We study the computational complexity of repairing inconsistent databases that violate integrity constraints, where the database values belong to an underlying metric space. The goal is to update the database values to retain consistency while minimizing the total distance between the original values and the repaired ones. We consider what we refer to as \emph{coincidence constraints}, which include key constraints, inclusion, foreign keys, and generally any restriction on the relationship between the numbers of cells of different labels (attributes) coinciding in a single value, for a fixed attribute set. We begin by showing that the problem is APX-hard for general metric spaces. We then present an algorithm solving the problem optimally for tree metrics, which generalize both the line metric (i.e., where repaired values are numbers) and the discrete metric (i.e., where we simply count the number of changed values). Combining our algorithm for tree metrics and a classic result on probabilistic tree embeddings, we design a (high probability) logarithmic-ratio approximation for general metrics. We also study the variant of the problem where each individual value's allowed change is limited. In this variant, it is already NP-complete to decide the existence of any legal repair for a general metric, and we present a polynomial-time repairing algorithm for the case of a line metric.
翻译:数据集常包含自然存在于度量空间中的值:数字、字符串、地理位置、欧几里得空间中的机器学习嵌入等。我们研究修复违反完整性约束的不一致数据库的计算复杂度,其中数据库值属于一个底层度量空间。目标是通过更新数据库值以保持一致性,同时最小化原始值与修复值之间的总距离。我们考虑所谓\emph{重合约束},这类约束包括键约束、包含约束、外键约束,以及更一般地,对于固定属性集,限制不同标签(属性)的单元格在单一值上重合的数量关系。我们首先证明该问题对于一般度量空间是APX难的。随后,我们提出一种算法,能够为树度量空间(该空间同时推广了线度量空间(即修复值为数字的情况)和离散度量空间(即仅统计更改值的数量))最优地解决问题。结合我们针对树度量空间的算法与关于概率树嵌入的经典结果,我们为一般度量空间设计了一个(高概率)对数比近似方案。我们还研究了每个单独值的允许修改范围受限的问题变体。在此变体中,对于一般度量空间,仅判断是否存在任何合法修复已是NP完全的;我们针对线度量空间的情况提出了一种多项式时间的修复算法。