Pretrained deep-learning models are the go-to solution for images or text. However, for tabular data the standard is still to train tree-based models. Indeed, transfer learning on tables hits the challenge of data integration: finding correspondences, correspondences in the entries (entity matching) where different words may denote the same entity, correspondences across columns (schema matching), which may come in different orders, names... We propose a neural architecture that does not need such correspondences. As a result, we can pretrain it on background data that has not been matched. The architecture -- CARTE for Context Aware Representation of Table Entries -- uses a graph representation of tabular (or relational) data to process tables with different columns, string embedding of entries and columns names to model an open vocabulary, and a graph-attentional network to contextualize entries with column names and neighboring entries. An extensive benchmark shows that CARTE facilitates learning, outperforming a solid set of baselines including the best tree-based models. CARTE also enables joint learning across tables with unmatched columns, enhancing a small table with bigger ones. CARTE opens the door to large pretrained models for tabular data.
翻译:预训练的深度学习模型已成为图像或文本处理的首选方案。然而,对于表格数据,目前的标准方法仍是训练基于树的模型。实际上,表格数据的迁移学习面临数据整合的挑战:需要建立条目间的对应关系(实体匹配,其中不同词汇可能指向同一实体),以及跨列对应关系(模式匹配),这些列可能具有不同的顺序、名称等。我们提出了一种无需此类对应关系的神经架构。因此,我们可以在未经匹配的背景数据上对其进行预训练。该架构——CARTE(面向表格条目的上下文感知表示)——使用表格(或关系)数据的图表示来处理具有不同列的表格,通过条目及列名的字符串嵌入来建模开放词汇表,并利用图注意力网络将条目与列名及相邻条目进行上下文关联。大量基准测试表明,CARTE 有效促进了学习效果,其性能超越了一组坚实的基线模型(包括最优的基于树的模型)。CARTE 还能支持跨未匹配列表格的联合学习,通过较大规模的表格增强小规模表格的学习能力。CARTE 为表格数据的大规模预训练模型开启了新的可能。