Automatically inferring join relationships is a critical task for effective data discovery, integration, querying and reuse. However, accurately and efficiently identifying these relationships in large and complex schemas can be challenging, especially in enterprise settings where access to data values is constrained. In this paper, we introduce the problem of join graph inference when only metadata is available. We conduct an empirical study on a large number of real-world schemas and observe that join graphs when represented as adjacency matrices exhibit two key properties: high sparsity and low-rank structure. Based on these novel observations, we formulate join graph inference as a low-rank matrix completion problem and propose Nexus, an end-to-end solution using only metadata. To further enhance accuracy, we propose a novel Expectation-Maximization algorithm that alternates between low-rank matrix completion and refining join candidate probabilities by leveraging Large Language Models. Our extensive experiments demonstrate that Nexus outperforms existing methods by a significant margin on four datasets including a real-world production dataset. Additionally, Nexus can operate in a fast mode, providing comparable results with up to 6x speedup, offering a practical and efficient solution for real-world deployments.
翻译:自动推断连接关系是实现有效数据发现、集成、查询和重用的关键任务。然而,在大型复杂模式中准确高效地识别这些关系具有挑战性,尤其是在数据值访问受限的企业环境中。本文提出了仅当元数据可用时的连接图推断问题。我们对大量真实世界模式进行了实证研究,发现以邻接矩阵表示的连接图展现出两个关键特性:高稀疏性和低秩结构。基于这些新颖观察,我们将连接图推断形式化为低秩矩阵补全问题,并提出仅使用元数据的端到端解决方案 Nexus。为进一步提升准确性,我们提出一种新颖的期望最大化算法,该算法通过利用大型语言模型,在低秩矩阵补全与优化连接候选概率之间交替进行。我们的大量实验表明,在包含真实世界生产数据集在内的四个数据集上,Nexus 显著优于现有方法。此外,Nexus 可在快速模式下运行,以高达 6 倍的加速比提供可比结果,为实际部署提供了实用高效的解决方案。