Advanced Dataset Discovery: When Multi-Query-Dataset Cardinality Estimation Matters

As available data increases, so too does the demand to dataset discovery. Existing studies often yield coarse-grained results where significant information overlaps and non-relevant data occur. They also implicitly assume that a user can purchase all datasets found, which is rarely true in practice. Therefore, achieving dataset discovery results with less redundancy using fine-grained information needs and a budget is desirable. To achieve this, we study the problem of finding a set of datasets that maximize distinctiveness based on a user's fine-grained information needs and a base dataset while keeping the total price of the datasets within a budget. The user's fine-grained information needs are expressed as a query set and the distinctiveness for a set of datasets, which is the number of distinct tuples produced by the query set on the datasets which do not overlap with the base dataset. First, we prove the NP-hardness of this problem. Then, we develop a greedy algorithm that achieves an approximation of (1-e^{-1})/2. But this algorithm is neither efficient nor scalable as it frequently computes the exact distinctiveness during dataset selection, which requires every tuple for the query result overlap in multiple datasets to be tested. To this end, we propose an efficient and effective machine-learning-based (ML-based) algorithm to estimate the distinctiveness for a set of datasets, without the need for testing every tuple. The proposed algorithm is the first to support cardinality estimation (CE) for a query set on multiple datasets, as previous studies only support CE for a single query on a single dataset, and cannot effectively identify query result overlaps in multiple datasets. Extensive experiments using five real-world data pools demonstrate that our greedy algorithm using ML-based distinctiveness estimation outperforms all other baselines in both effectiveness and efficiency.

翻译：随着可用数据的增加，数据集发现的需求也随之增长。现有研究通常产生粗粒度的结果，其中存在显著的信息重叠和不相关数据。它们还隐含地假设用户能够购买所有发现的数据集，但这在实践中很少成立。因此，利用细粒度信息需求和预算实现冗余度更低的数据集发现结果是可取的。为此，我们研究了如何基于用户的细粒度信息需求和基础数据集，在保持数据集总价格在预算范围内的前提下，找到一组最大化独特性的数据集。用户的细粒度信息需求通过查询集表达，而一组数据集的独特性定义为查询集在这些数据集上产生的、不与基础数据集重叠的不同元组数量。首先，我们证明了该问题的NP难度。接着，我们开发了一种贪心算法，达到了(1-e^{-1})/2的近似比。但该算法既不高效也不可扩展，因为在数据集选择过程中需频繁计算精确独特性，这要求对多个数据集中查询结果重叠的每个元组进行测试。为此，我们提出了一种高效且有效的基于机器学习的算法来估计一组数据集的独特性，无需测试每个元组。所提算法首次支持多数据集上查询集的基数估计，而以往研究仅支持单一数据集上单一查询的基数估计，且无法有效识别多个数据集中的查询结果重叠。使用五个真实世界数据池进行的广泛实验表明，采用基于ML的独特性估计的贪心算法在有效性和效率上均优于所有其他基线方法。