Avoiding redundancy in query results has been extensively studied in relational databases and information retrieval, yet its implications for data lakes remain largely unexplored. We bridge this gap by investigating how to discover unionable tables that contribute new information for a given query table in large-scale data lakes. We formally define Novel Table Search (NTS) as the problem of finding tables that are novel with respect to a given query table and identify two desirable properties that any scoring function for NTS should satisfy. We introduce a concrete scoring mechanism designed to maximize syntactic novelty, prove that it satisfies the proposed properties, and show that the associated optimization problem is NP-hard. To address this challenge, we develop an efficient approximation technique based on penalization, i.e., Attribute-Based Novel Table Search (ANTs). We propose three additional NTS variants to achieve syntactic novelty and introduce two evaluation metrics for syntactic novelty. Through extensive experiments, we demonstrate that ANTs outperforms other methods in capturing syntactic novelty across evaluation metrics and various benchmarks, while also achieving the lowest execution time.
翻译:在关系型数据库和信息检索领域,避免查询结果冗余已得到广泛研究,但其对数据湖的影响在很大程度上仍未得到探索。我们通过研究如何在大规模数据湖中发现能为给定查询表提供新信息的可并表,来弥合这一差距。我们正式将新型表搜索(NTS)定义为寻找相对于给定查询表具有新颖性的表的问题,并识别了任何NTS评分函数都应满足的两个理想属性。我们引入了一种旨在最大化语法新颖性的具体评分机制,证明其满足所提出的属性,并表明相关的优化问题是NP难的。为应对这一挑战,我们开发了一种基于惩罚的高效近似技术,即基于属性的新型表搜索(ANTs)。我们提出了另外三种NTS变体以实现语法新颖性,并引入了两种用于评估语法新颖性的指标。通过大量实验,我们证明ANTs在捕获跨评估指标和各种基准测试的语法新颖性方面优于其他方法,同时实现了最低的执行时间。