We study the problem of discovering joinable datasets at scale. We approach the problem from a learning perspective relying on profiles. These are succinct representations that capture the underlying characteristics of the schemata and data values of datasets, which can be efficiently extracted in a distributed and parallel fashion. Profiles are then compared, to predict the quality of a join operation among a pair of attributes from different datasets. In contrast to the state-of-the-art, we define a novel notion of join quality that relies on a metric considering both the containment and cardinality proportion between join candidate attributes. We implement our approach in a system called NextiaJD, and present experiments to show the predictive performance and computational efficiency of our method. Our experiments show that NextiaJD obtains greater predictive performance to that of hash-based methods while we are able to scale-up to larger volumes of data.
翻译:我们研究大规模数据集的可连接性发现问题。本文从基于数据概要的学习视角出发,这些数据摘要是能捕捉数据集模式与数据值底层特征的简洁表示,可通过分布式并行方式高效提取。通过比较不同数据集属性对间的数据概要,可预测连接操作的质量。与现有技术不同,我们提出一种基于连接候选属性间包含度与基数比例度量的新型连接质量定义。我们将该方法实现为NextiaJD系统,并通过实验验证其预测性能与计算效率。实验表明,NextiaJD的预测性能优于基于哈希的方法,同时能扩展到更大规模的数据处理场景。