PIPER: Content-Based Table Search via profiling and LLM-Generated Pseudoqueries

The rapid growth of tabular datasets in data lakes, data spaces, and open data portals makes effective dataset search essential for reuse and analysis. Existing search systems rely mainly on metadata, which is often incomplete or low quality, especially for tables whose meaning depends on both schema and cell values. Recent advances in Large Language Models (LLMs) enable richer, content-based representations of tables. However, prior LLM-based retrieval methods have focused on Table Question Answering, where the goal is to select a single table to answer a question, rather than retrieve and rank relevant datasets. We propose PIPER, a content-driven retrieval method for tabular datasets that uses table profiles and LLM-generated queries embedded for dense retrieval. Designed for dataset search in poor-metadata settings, PIPER outperforms both classical metadata-based baselines and strong TableQA retrieval methods, demonstrating the value of LLM-based content modeling for tabular dataset search.

翻译：数据湖、数据空间和开放数据门户中表格数据集的快速增长使得有效的数据集搜索对于复用和分析至关重要。现有搜索系统主要依赖元数据，但元数据往往不完整或质量较低，尤其是当表格语义依赖于模式与单元格值的结合时。大语言模型（LLMs）的最新进展使得表格的基于内容的更丰富表示成为可能。然而，此前基于LLM的检索方法主要聚焦于表格问答（Table Question Answering），其目标是为回答问题选取单个表格，而非检索并排序相关数据集。我们提出PIPER，一种用于表格数据集的基于内容的检索方法，该方法利用表格概要和大语言模型生成的查询进行嵌入密集检索。PIPER专为低元数据环境下的数据集搜索而设计，其性能优于传统基于元数据的基线方法和强TableQA检索方法，证明了基于LLM的内容建模在表格数据集搜索中的价值。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【新书】使用大型语言模型进行数据分析：文本、表格、图像与音频

专知会员服务

43+阅读 · 2025年4月16日

定制化大型语言模型的图检索增强生成综述

专知会员服务

38+阅读 · 2025年1月28日

《大语言模型的数据合成与增强综述》

专知会员服务

44+阅读 · 2024年10月19日

【NeurIPS2024】TableRAG：基于语言模型的百万标记表格理解

专知会员服务

38+阅读 · 2024年10月8日