Order Dependencies (ODs) have many applications, such as query optimization, data integration, and data cleaning. Although many works addressed the problem of discovering OD (and its variants), they do not consider datasets with missing values, a standard observation in real-world datasets. This paper introduces the novel notion of Embedded ODs (eODs) to deal with missing values. The intuition of eODs is to confirm ODs only on tuples with no missing values on a given embedding (a set of attributes). In this paper, we address the problem of validating a given eOD. If the eOD holds, we return true. Otherwise, we search for an updated embedding such that the updated eOD holds. If such embedding does not exist, we return false. A trivial requirement is to consider an embedding such that the number of ignored tuples is minimized. We show that it is NP-complete to compute such embedding. We therefore propose an efficient heuristic algorithm for validating embedded ODs. We conduct experiments on real-world datasets, and the results confirm the efficiency of our algorithm.
翻译:序依赖(ODs)在查询优化、数据集成和数据清洗等众多领域具有广泛应用。尽管已有大量工作研究了OD(及其变体)的发现问题,但这些工作均未考虑真实数据集中普遍存在的缺失值问题。本文提出嵌入序依赖(Embedded ODs, eODs)这一新颖概念以处理缺失值。eOD的核心思想是:仅在给定嵌入(一组属性)上无缺失值的元组上验证OD约束。本文研究给定eOD的验证问题:若eOD成立则返回真,否则寻找更新后的嵌入使修正后的eOD成立;若不存在这样的嵌入则返回假。一个基本要求是选择能最小化忽略元组数量的嵌入。我们证明求解该嵌入问题是NP完全的,进而提出一种高效的启发式算法用于验证嵌入序依赖。在真实数据集上的实验结果表明,所提算法具有高效性。