Recent advancements in Natural Language Processing (NLP) have witnessed the groundbreaking impact of pretrained models, yielding impressive outcomes across various tasks. This study seeks to extend the power of pretraining methodologies to tabular data, a domain traditionally overlooked, yet inherently challenging due to the plethora of table schemas intrinsic to different tasks. The primary research questions underpinning this work revolve around the adaptation to heterogeneous table structures, the establishment of a universal pretraining protocol for tabular data, the generalizability and transferability of learned knowledge across tasks, the adaptation to diverse downstream applications, and the incorporation of incremental columns over time. In response to these challenges, we introduce UniTabE, a pioneering method designed to process tables in a uniform manner, devoid of constraints imposed by specific table structures. UniTabE's core concept relies on representing each basic table element with a module, termed TabUnit. This is subsequently followed by a Transformer encoder to refine the representation. Moreover, our model is designed to facilitate pretraining and finetuning through the utilization of free-form prompts. In order to implement the pretraining phase, we curated an expansive tabular dataset comprising approximately 13 billion samples, meticulously gathered from the Kaggle platform. Rigorous experimental testing and analyses were performed under a myriad of scenarios to validate the effectiveness of our methodology. The experimental results demonstrate UniTabE's superior performance against several baseline models across a multitude of benchmark datasets. This, therefore, underscores UniTabE's potential to significantly enhance the semantic representation of tabular data, thereby marking a significant stride in the field of tabular data analysis.
翻译:近期自然语言处理领域的突破性进展展示了预训练模型在各种任务中的显著成效。本研究旨在将预训练方法推广至表格数据这一传统上被忽视的领域,该领域因不同任务固有的多种表格模式而极具挑战性。本工作的核心研究问题涉及对异构表格结构的适应、通用表格数据预训练协议的建立、所学知识的跨任务泛化与迁移、对多样化下游任务的适配,以及随时间推移的增量列整合。针对这些挑战,我们提出UniTabE——一种创新方法,能够以统一方式处理表格数据,不受特定表格结构的限制。UniTabE的核心概念在于用名为TabUnit的模块表示每个基本表格元素,随后通过Transformer编码器对表示进行精炼。此外,模型设计支持利用自由形式提示进行预训练和微调。为实施预训练阶段,我们构建了包含约130亿样本的大规模表格数据集,这些数据精心收集自Kaggle平台。通过多种场景下的严格实验测试与分析,验证了本方法的有效性。实验结果表明,UniTabE在多个基准数据集上相较于若干基线模型展现出优越性能,从而彰显其显著增强表格数据语义表示的潜力,标志着表格数据分析领域的重要进展。