Tabular data analysis is performed every day across various domains. It requires an accurate understanding of field semantics to correctly operate on table fields and find common patterns in daily analysis. In this paper, we introduce the AnaMeta dataset, a collection of 467k tables with derived supervision labels for four types of commonly used field metadata: measure/dimension dichotomy, common field roles, semantic field type, and default aggregation function. We evaluate a wide range of models for inferring metadata as the benchmark. We also propose a multi-encoder framework, called KDF, which improves the metadata understanding capability of tabular models by incorporating distribution and knowledge information. Furthermore, we propose four interfaces for incorporating field metadata into downstream analysis tasks.
翻译:表格数据分析在各领域每日进行。这需要准确理解字段语义,以正确操作表格字段并发现日常分析中的常见规律。本文介绍AnaMeta数据集,该数据集包含467k张表格及其衍生监督标签,涵盖四种常用字段元数据:度量/维度二分法、通用字段角色、语义字段类型以及默认聚合函数。我们评估了多种模型作为元数据推断的基准。同时提出名为KDF的多编码器框架,通过融合分布与知识信息增强表格模型的元数据理解能力。最后,我们为将字段元数据集成至下游分析任务设计了四种接口方案。