Order Dependencies (ODs) have many applications, such as query optimization, data integration, and data cleaning. Although many works addressed the problem of discovering OD (and its variants), they do not consider datasets with missing values, a standard observation in real-world datasets. This paper introduces the novel notion of Embedded ODs (eODs) to deal with missing values. The intuition of eODs is to confirm ODs only on tuples with no missing values on a given embedding (a set of attributes). In this paper, we address the problem of validating a given eOD. If the eOD holds, we return true. Otherwise, we search for an updated embedding such that the updated eOD holds. If such embedding does not exist, we return false. A trivial requirement is to consider an embedding such that the number of ignored tuples is minimized. We show that it is NP-complete to compute such embedding. We therefore propose an efficient heuristic algorithm for validating embedded ODs. We conduct experiments on real-world datasets, and the results confirm the efficiency of our algorithm.
翻译:序依赖(OD)在查询优化、数据集成和数据清理等众多应用中具有重要价值。尽管已有大量研究探讨了OD(及其变体)的发现算法,但这些工作未考虑现实数据集中普遍存在的缺失值问题。本文提出嵌入式序依赖(eOD)这一新概念以处理缺失值。eOD的核心思想是:仅在给定嵌入(即属性子集)下无缺失值的元组上验证OD。本文研究给定eOD的验证问题:若eOD成立则返回真;否则寻找更新的嵌入使得更新后的eOD成立;若不存在这样的嵌入则返回假。一个基本需求是选择能最小化忽略元组数量的嵌入。我们证明计算此类嵌入是NP完全的,故提出一种高效的启发式算法来验证嵌入式OD。在真实数据集上的实验验证了该算法的有效性。