Tabular data pervades the landscape of the World Wide Web, playing a foundational role in the digital architecture that underpins online information. Given the recent influence of large-scale pretrained models like ChatGPT and SAM across various domains, exploring the application of pretraining techniques for mining tabular data on the web has emerged as a highly promising research direction. Indeed, there have been some recent works around this topic where most (if not all) of them are limited in the scope of a fixed-schema/single table. Due to the scale of the dataset and the parameter size of the prior models, we believe that we have not reached the ''BERT moment'' for the ubiquitous tabular data. The development on this line significantly lags behind the counterpart research domains such as natural language processing. In this work, we first identify the crucial challenges behind tabular data pretraining, particularly overcoming the cross-table hurdle. As a pioneering endeavor, this work mainly (i)-contributes a high-quality real-world tabular dataset, (ii)-proposes an innovative, generic, and efficient cross-table pretraining framework, dubbed as CM2, where the core to it comprises a semantic-aware tabular neural network that uniformly encodes heterogeneous tables without much restriction and (iii)-introduces a novel pretraining objective -- prompt Masked Table Modeling (pMTM) -- inspired by NLP but intricately tailored to scalable pretraining on tables. Our extensive experiments demonstrate CM2's state-of-the-art performance and validate that cross-table pretraining can enhance various downstream tasks.
翻译:表格数据遍布万维网,在支撑在线信息的数字架构中发挥着基础性作用。鉴于ChatGPT和SAM等大规模预训练模型近期在多个领域的影响力,探索将预训练技术应用于Web表格数据挖掘已成为极具前景的研究方向。确实,该领域近期已有一些研究工作,但其中大多数(甚至全部)局限于固定模式/单表范围。由于数据集规模和先前模型的参数量限制,我们认为针对无处不在的表格数据尚未迎来"BERT时刻"。这一方向的发展显著落后于自然语言处理等对应研究领域。本研究首先识别了表格数据预训练的关键挑战,特别是克服跨表障碍。作为开创性探索,本文主要贡献包括:(i)构建高质量真实场景表格数据集;(ii)提出创新性、通用且高效的跨表预训练框架CM2,其核心包含语义感知的表格神经网络,可无过多限制地对异质表格进行统一编码;(iii)引入受自然语言处理启发但专为表格可扩展预训练设计的新型预训练目标——提示式掩码表格建模(pMTM)。大量实验证明CM2实现了最先进的性能,并验证跨表预训练可增强各类下游任务。