Efficient and Effective Table-Centric Table Union Search in Data Lakes

In data lakes, information on the same subject is often fragmented across multiple tables. Table union search aims to find the top-k tables that can be unioned with a query table to extend it with more rows, without relying on metadata or ground-truth labels. Existing methods are mainly column-centric: they focus on modeling column unionability scores using column embeddings, which are then used throughout the search process for column matching, filtering, and aggregation. However, this overlooks holistic table-level semantics, which may result in suboptimal rankings and inefficiencies. We introduce TACTUS, a novel table-centric method for table union search. Unlike prior work that searches from columns to tables, we search in a table-first way and examine columns only in the final step. During offline processing, we directly generate table embeddings for holistic, table-level unionability scoring by designing table-level representation techniques, including positive table pair construction to simulate unionable tables, two-pronged negative table sampling to avoid latent positives and mine hard negatives to enhance representation quality, and attentive table encoding for effective embeddings. During online search, we first develop a table-centric adaptive candidate retrieval method that efficiently selects a compact, high-quality candidate pool by leveraging the distribution of table-level unionability scores induced by table embeddings. We then inspect columns only within this compact candidate set and design a dual-evidence reranking technique that integrates table-level and column-level scores to refine the final top-k results. Extensive experiments on real-world datasets show that TACTUS significantly improves result quality while being much faster than existing methods in both offline and online processing, often by an order of magnitude.

翻译：在数据湖中，同一主题的信息通常分散在多个表中。表联合搜索旨在找到可以与查询表进行联合操作以扩展其行数的前k个表，且不依赖于元数据或真实标签。现有方法主要是以列为中心的：它们侧重于使用列嵌入来建模列可联合性分数，这些分数随后在整个搜索过程中用于列匹配、过滤和聚合。然而，这种方法忽略了整体的表级语义，可能导致次优的排序结果和效率低下。我们提出了TACTUS，一种新颖的以表为中心的表联合搜索方法。与先前从列到表的搜索方式不同，我们采用表优先的方式进行搜索，仅在最后一步检查列。在离线处理阶段，我们通过设计表级表示技术直接生成表嵌入，用于整体的表级可联合性评分，这些技术包括：构建正表对以模拟可联合表，采用双管齐下的负表采样以避免潜在正例并挖掘难负例以提升表示质量，以及使用注意力表编码以获得有效的嵌入。在线搜索阶段，我们首先开发了一种以表为中心的自适应候选检索方法，该方法通过利用表嵌入所诱导的表级可联合性分数分布，高效地选择一个紧凑且高质量的候选池。随后，我们仅在此紧凑候选集中检查列，并设计了一种双重证据重排序技术，该技术融合了表级和列级分数以优化最终的前k个结果。在真实数据集上的大量实验表明，TACTUS在显著提升结果质量的同时，其离线和在线处理速度均远快于现有方法，通常快一个数量级。