Table retrieval is the task of retrieving the most relevant tables from large-scale corpora given natural language queries. However, structural and semantic discrepancies between unstructured text and structured tables make embedding alignment particularly challenging. Recent methods such as QGpT attempt to enrich table semantics by generating synthetic queries, yet they still rely on coarse partial-table sampling and simple fusion strategies, which limit semantic diversity and hinder effective query-table alignment. We propose STAR (Semantic Table Representation), a lightweight framework that improves semantic table representation through semantic clustering and weighted fusion. STAR first applies header-aware K-means clustering to group semantically similar rows and selects representative centroid instances to construct a diverse partial table. It then generates cluster-specific synthetic queries to comprehensively cover the table's semantic space. Finally, STAR employs weighted fusion strategies to integrate table and query embeddings, enabling fine-grained semantic alignment. This design enables STAR to capture complementary information from structured and textual sources, improving the expressiveness of table representations. Experiments on five benchmarks show that STAR achieves consistently higher Recall than QGpT on all datasets, demonstrating the effectiveness of semantic clustering and adaptive weighted fusion for robust table representation. Our code is available at https://github.com/adsl135789/STAR.
翻译:表格检索任务旨在根据自然语言查询从大规模语料库中检索最相关的表格。然而,非结构化文本与结构化表格之间的结构及语义差异使得嵌入对齐尤为困难。现有方法如QGpT尝试通过生成合成查询来丰富表格语义,但仍依赖于粗粒度的部分表格采样和简单的融合策略,这限制了语义多样性并阻碍了有效的查询-表格对齐。我们提出STAR(语义化表格表示),一种通过语义聚类与加权融合改进表格语义表示的轻量级框架。STAR首先应用表头感知K-means聚类对语义相似的行进行分组,并选取代表性中心实例以构建多样化的部分表格;随后生成针对特定聚类的合成查询,以全面覆盖表格的语义空间;最后,STAR采用加权融合策略整合表格与查询嵌入,实现细粒度的语义对齐。该设计使STAR能够从结构化与文本化数据源中捕获互补信息,从而提升表格表示的表达能力。在五个基准数据集上的实验表明,STAR在所有数据集上均取得比QGpT更高的召回率,验证了语义聚类与自适应加权融合对构建鲁棒表格表示的有效性。代码已开源:https://github.com/adsl135789/STAR。