How can we discover join relationships among columns of tabular data in a data repository? Can this be done effectively when metadata is missing? Traditional column matching works mainly rely on similarity measures based on exact value overlaps, hence missing important semantics or failing to handle noise in the data. At the same time, recent dataset discovery methods focusing on deep table representation learning techniques, do not take into consideration the rich set of column similarity signals found in prior matching and discovery methods. Finally, existing methods heavily depend on user-provided similarity thresholds, hindering their deployability in real-world settings. In this paper, we propose OmniMatch, a novel join discovery technique that detects equi-joins and fuzzy-joins betwen columns by combining column-pair similarity measures with Graph Neural Networks (GNNs). OmniMatch's GNN can capture column relatedness leveraging graph transitivity, significantly improving the recall of join discovery tasks. At the same time, OmniMatch also increases the precision by augmenting its training data with negative column join examples through an automated negative example generation process. Most importantly, compared to the state-of-the-art matching and discovery methods, OmniMatch exhibits up to 14% higher effectiveness in F1 score and AUC without relying on metadata or user-provided thresholds for each similarity metric.
翻译:如何发现数据仓库中表格数据列之间的连接关系?当元数据缺失时能否有效完成这一任务?传统列匹配方法主要依赖基于精确值重叠的相似性度量,因此会遗漏重要语义或难以处理数据中的噪声。与此同时,聚焦深度表格表示学习技术的近期数据集发现方法,未能充分考虑先前匹配与发现方法中丰富的列相似性信号。最后,现有方法严重依赖用户提供的相似性阈值,阻碍了其在真实场景中的部署。本文提出OmniMatch,一种通过结合列对相似性度量和图神经网络来检测列间等值连接与模糊连接的新型连接发现技术。OmniMatch的图神经网络可利用图传递性捕获列关联性,显著提升连接发现任务的召回率。同时,OmniMatch通过自动化负例生成过程向训练数据添加负面的列连接示例,从而提升精确率。最重要的是,与最先进的匹配和发现方法相比,OmniMatch在无需依赖元数据或用户为每个相似性度量提供阈值的情况下,在F1分数和AUC指标上展现出最高14%的效果提升。