Increasing amounts of structured data can provide value for research and business if the relevant data can be located. Often the data is in a data lake without a consistent schema, making locating useful data challenging. Table search is a growing research area, but existing benchmarks have been limited to displayed tables. Tables sized and formatted for display in a Wikipedia page or ArXiv paper are considerably different from data tables in both scale and style. By using metadata associated with open data from government portals, we create the first dataset to benchmark search over data tables at scale. We demonstrate three styles of table-to-table related table search. The three notions of table relatedness are: tables produced by the same organization, tables distributed as part of the same dataset, and tables with a high degree of overlap in the annotated tags. The keyword tags provided with the metadata also permit the automatic creation of a keyword search over tables benchmark. We provide baselines on this dataset using existing methods including traditional and neural approaches.
翻译:越来越多的结构化数据若能被有效定位,将为科研和商业领域创造价值。然而,这些数据通常存储于缺乏统一模式的数据湖中,导致数据定位困难。表格搜索作为一个新兴研究方向,现有基准测试仍局限于展示型表格。维基百科页面或arXiv论文中的表格在规模与格式上均与数据表格存在显著差异。通过利用政府门户网站开放数据的元数据,我们创建了首个面向大规模数据表格搜索的基准数据集。本文展示了三种表到表的关联搜索范式:同一机构生成的表格、同一数据集分发的表格、以及标注标签高度重叠的表格。元数据中的关键词标签还可自动构建表格关键词搜索基准。我们采用传统方法与神经方法等现有技术,为该数据集提供了基线结果。