Observatory: Characterizing Embeddings of Relational Tables

Language models and specialized table embedding models have recently demonstrated strong performance on many tasks over tabular data. Researchers and practitioners are keen to leverage these models in many new application contexts; but limited understanding of the strengths and weaknesses of these models, and the table representations they generate, makes the process of finding a suitable model for a given task reliant on trial and error. There is an urgent need to gain a comprehensive understanding of these models to minimize inefficiency and failures in downstream usage. To address this need, we propose Observatory, a formal framework to systematically analyze embedding representations of relational tables. Motivated both by invariants of the relational data model and by statistical considerations regarding data distributions, we define eight primitive properties, and corresponding measures to quantitatively characterize table embeddings for these properties. Based on these properties, we define an extensible framework to evaluate language and table embedding models. We collect and synthesize a suite of datasets and use Observatory to analyze nine such models. Our analysis provides insights into the strengths and weaknesses of learned representations over tables. We find, for example, that some models are sensitive to table structure such as column order, that functional dependencies are rarely reflected in embeddings, and that specialized table embedding models have relatively lower sample fidelity. Such insights help researchers and practitioners better anticipate model behaviors and select appropriate models for their downstream tasks, while guiding researchers in the development of new models.

翻译：摘要：语言模型与专用表格嵌入模型近期在表格数据的多项任务中展现出卓越性能。研究者和实践者迫切希望将这些模型应用于众多新场景，但由于对这些模型及其生成的表格表征的优缺点认知有限，为特定任务寻找合适模型的过程仍依赖反复试错。为减少下游应用中的低效与失败，亟需对这些模型形成全面理解。为此，我们提出Observatory——一个系统分析关系型表格嵌入表征的形式化框架。受关系型数据模型的不变性与数据分布统计特性的双重启发，我们定义了八项原始属性及其对应的量化度量标准，用于表征表格嵌入在这些属性上的表现。基于这些属性，我们构建了一个可扩展的评估框架，用于评测语言模型与表格嵌入模型。我们收集并整合了一组数据集，利用Observatory对九个此类模型进行了分析。分析结果揭示了学习型表格表征的优势与局限。例如我们发现：部分模型对列顺序等表格结构敏感；函数依赖关系极少在嵌入中体现；专用表格嵌入模型的样本保真度相对较低。这些洞察有助于研究者和实践者更好预测模型行为、为下游任务选择合适模型，同时为新型模型的研发提供指导方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日