DPDisc: From Factoid Questions to Data Product Requests for Open-World Data Product Discovery over Tables and Text

Data products are reusable, self-contained assets designed for specific business use cases. Automating their discovery is of great industry interest, as it enables efficient data access in large data lakes and supports analytical workflows. However, no benchmark currently exists for data product discovery over hybrid table-text corpora. Existing datasets focus on answering single factoid questions over individual tables rather than assembling multiple related data assets into coherent products. To address this gap, we present DPDisc, the first large-scale benchmark for data product discovery, where systems must retrieve coherent collections of tables and passages to satisfy high-level Data Product Requests (DPRs). We introduce DPForge, an automated pipeline that systematically repurposes table-text QA datasets by clustering related tables and passages into coherent data products, generating professional-level analytical requests using an LLM ensemble, and validating quality through multi-phase LLM evaluation. DPDisc comprises 13,076 validated instances with full provenance, derived from three representative datasets spanning open-domain and financial domains. Baseline experiments with sparse, dense, and hybrid retrieval methods imply evaluation feasibility while revealing substantial performance gaps across domains, indicating opportunities for future research in structure-aware data product discovery. Code and datasets are available at: Dataset: https://huggingface.co/datasets/ibm-research/data-product-benchmark Code: https://github.com/ibm/data-product-benchmark

翻译：摘要：数据产品是可复用、自包含的资产，专为特定业务用例而设计。实现其自动化发现具有重要的行业价值，因为它能促进大型数据湖中高效的数据访问并支持分析工作流。然而，当前尚无针对混合表-文语料库的数据产品发现基准。现有数据集仅聚焦于对单个表格进行事实性问题的回答，而非将多个相关数据资产整合为连贯的产品。为填补这一空白，我们提出DPDisc——首个面向数据产品发现的大规模基准，要求系统检索连贯的表格与文本集合以满足高层级的数据产品请求（DPRs）。我们引入DPForge，一个自动化流水线，通过聚类相关表格与文本形成连贯数据产品、利用大语言模型集成生成专业级分析请求，并通过多阶段大语言模型评估验证质量，系统化地复用表格-文本问答数据集。DPDisc包含13,076个经校验的实例，附带完整溯源信息，源自覆盖开放域与金融领域的三个代表性数据集。采用稀疏检索、密集检索及混合检索方法的基线实验表明，该基准具有可评估性，同时揭示了跨域性能的显著差距，为未来面向结构感知的数据产品发现研究指明了方向。代码与数据集获取地址：数据集：https://huggingface.co/datasets/ibm-research/data-product-benchmark 代码：https://github.com/ibm/data-product-benchmark