Discovering valuable insights from data through meaningful associations is a crucial task. However, it becomes challenging when trying to identify representative patterns in quantitative databases, especially with large datasets, as enumeration-based strategies struggle due to the vast search space involved. To tackle this challenge, output space sampling methods have emerged as a promising solution thanks to its ability to discover valuable patterns with reduced computational overhead. However, existing sampling methods often encounter limitations when dealing with large quantitative database, resulting in scalability-related challenges. In this work, we propose a novel high utility pattern sampling algorithm and its on-disk version both designed for large quantitative databases based on two original theorems. Our approach ensures both the interactivity required for user-centered methods and strong statistical guarantees through random sampling. Thanks to our method, users can instantly discover relevant and representative utility pattern, facilitating efficient exploration of the database within seconds. To demonstrate the interest of our approach, we present a compelling use case involving archaeological knowledge graph sub-profiles discovery. Experiments on semantic and none-semantic quantitative databases show that our approach outperforms the state-of-the art methods.
翻译:通过有意义的关联从数据中发现有价值的洞见是一项关键任务。然而,在定量数据库中识别具有代表性的模式,尤其是在处理大规模数据集时,变得极具挑战性,因为基于枚举的策略会因搜索空间巨大而难以应对。为应对这一挑战,输出空间采样方法因其能够以较低的计算开销发现有价值模式而成为一种有前景的解决方案。然而,现有采样方法在处理大规模定量数据库时常常遇到局限性,导致可扩展性方面的挑战。本文基于两条原创定理,提出了一种新颖的高效用模式采样算法及其磁盘版本,两者均专为大规模定量数据库设计。我们的方法通过随机采样,既确保了以用户为中心方法所需的交互性,又提供了强有力的统计保证。得益于我们的方法,用户能够在数秒内即时发现相关且具有代表性的效用模式,从而实现对数据库的高效探索。为展示我们方法的实用价值,我们提出了一个涉及考古知识图谱子剖面发现的引人注目的用例。在语义和非语义定量数据库上的实验表明,我们的方法优于现有最先进方法。