The rapid growth of tabular datasets in data lakes, data spaces, and open data portals makes effective dataset search essential for reuse and analysis. Existing search systems rely mainly on metadata, which is often incomplete or low quality, especially for tables whose meaning depends on both schema and cell values. Recent advances in Large Language Models (LLMs) enable richer, content-based representations of tables. However, prior LLM-based retrieval methods have focused on Table Question Answering, where the goal is to select a single table to answer a question, rather than retrieve and rank relevant datasets. We propose PIPER, a content-driven retrieval method for tabular datasets that uses table profiles and LLM-generated queries embedded for dense retrieval. Designed for dataset search in poor-metadata settings, PIPER outperforms both classical metadata-based baselines and strong TableQA retrieval methods, demonstrating the value of LLM-based content modeling for tabular dataset search.
翻译:数据湖、数据空间和开放数据门户中表格数据集的快速增长使得有效的数据集搜索对于复用和分析至关重要。现有搜索系统主要依赖元数据,但元数据往往不完整或质量较低,尤其是当表格语义依赖于模式与单元格值的结合时。大语言模型(LLMs)的最新进展使得表格的基于内容的更丰富表示成为可能。然而,此前基于LLM的检索方法主要聚焦于表格问答(Table Question Answering),其目标是为回答问题选取单个表格,而非检索并排序相关数据集。我们提出PIPER,一种用于表格数据集的基于内容的检索方法,该方法利用表格概要和大语言模型生成的查询进行嵌入密集检索。PIPER专为低元数据环境下的数据集搜索而设计,其性能优于传统基于元数据的基线方法和强TableQA检索方法,证明了基于LLM的内容建模在表格数据集搜索中的价值。