HyperJoin：基于大语言模型增强的超图链接预测用于可连接表发现 (HyperJoin: LLM-augmented Hypergraph Link Prediction for Joinable Table Discovery)

As a pivotal task in data lake management, joinable table discovery has attracted widespread interest. While existing language model-based methods achieve remarkable performance by combining offline column representation learning with online ranking, their design insufficiently accounts for the underlying structural interactions: (1) offline, they directly model tables into isolated or pairwise columns, thereby struggling to capture the rich inter-table and intra-table structural information; and (2) online, they rank candidate columns based solely on query-candidate similarity, ignoring the mutual interactions among the candidates, leading to incoherent result sets. To address these limitations, we propose HyperJoin, a large language model (LLM)-augmented Hypergraph framework for Joinable table discovery. Specifically, we first construct a hypergraph to model tables using both the intra-table hyperedges and the LLM-augmented inter-table hyperedges. Consequently, the task of joinable table discovery is formulated as link prediction on this constructed hypergraph. We then design HIN, a Hierarchical Interaction Network that learns expressive column representations through bidirectional message passing over columns and hyperedges. To strengthen coherence and internal consistency in the result columns, we cast online ranking as a coherence-aware top-k column selection problem. We then introduce a reranking module that leverages a maximum spanning tree algorithm to prune noisy connections and maximize coherence. Experiments demonstrate the superiority of HyperJoin, achieving average improvements of 21.4% (Precision@15) and 17.2% (Recall@15) over the best baseline.

翻译：作为数据湖管理中的关键任务，可连接表发现已引起广泛关注。尽管现有基于语言模型的方法通过结合离线列表示学习与在线排序取得了显著性能，但其设计未能充分考虑底层结构交互：(1) 离线阶段，这些方法直接将表建模为孤立或成对的列，难以捕获丰富的表间与表内结构信息；(2) 在线阶段，它们仅基于查询-候选相似度对候选列进行排序，忽略了候选列之间的相互关联，导致结果集缺乏一致性。为克服这些局限，我们提出HyperJoin——一种基于大语言模型（LLM）增强的超图框架用于可连接表发现。具体而言，我们首先构建超图，通过表内超边与LLM增强的表间超边对表进行建模。由此，可连接表发现任务被形式化为该超图上的链接预测问题。随后，我们设计HIN（分层交互网络），通过列与超边之间的双向消息传递学习具有表达力的列表示。为增强结果列的一致性与内部连贯性，我们将在线排序建模为一致性感知的top-k列选择问题，并引入重排序模块，利用最大生成树算法剔除噪声连接并最大化一致性。实验证明HyperJoin具有优越性，在最佳基线方法基础上实现了平均21.4%（Precision@15）与17.2%（Recall@15）的性能提升。