Graph entity dependencies (GEDs) are novel graph constraints, unifying keys and functional dependencies, for property graphs. They have been found useful in many real-world data quality and data management tasks, including fact checking on social media networks and entity resolution. In this paper, we study the discovery problem of GEDs -- finding a minimal cover of valid GEDs in a given graph data. We formalise the problem, and propose an effective and efficient approach to overcome major bottlenecks in GED discovery. In particular, we leverage existing graph partitioning algorithms to enable fast GED-scope discovery, and employ effective pruning strategies over the prohibitively large space of candidate dependencies. Furthermore, we define an interestingness measure for GEDs based on the minimum description length principle, to score and rank the mined cover set of GEDs. Finally, we demonstrate the scalability and effectiveness of our GED discovery approach through extensive experiments on real-world benchmark graph data sets; and present the usefulness of the discovered rules in different downstream data quality management applications.
翻译:图实体依赖(GED)是一种新型的图约束,统一了键和函数依赖,适用于属性图。它们已被发现在许多现实世界的数据质量和数据管理任务中非常有用,包括社交媒体网络上的事实核查和实体解析。本文研究GED的发现问题——即在给定图数据中寻找有效GED的最小覆盖集。我们对问题进行形式化,并提出一种有效且高效的方法来克服GED发现中的主要瓶颈。具体而言,我们利用现有的图划分算法实现快速的GED范围发现,并对规模过大的候选依赖空间采用有效的剪枝策略。此外,基于最小描述长度原则,我们定义了GED的有趣性度量,用于对挖掘出的GED覆盖集进行评分和排序。最后,通过在真实世界基准图数据集上进行大量实验,我们展示了所提出的GED发现方法的可扩展性和有效性;并阐述了发现的规则在不同下游数据质量管理应用中的实用性。