Given the ubiquitous use of tabular data in industries and the growing concerns in data privacy and security, tabular data synthesis emerges as a critical research area. The recent state-of-the-art methods show that large language models (LLMs) can be adopted to generate realistic tabular data. As LLMs pre-process tabular data as full text, they have the advantage of avoiding the curse of dimensionality associated with one-hot encoding high-dimensional data. However, their long training time and limited re-usability on new tasks prevent them from replacing exiting tabular generative models. In this paper, we propose Tabula, a tabular data synthesizer based on the language model structure. Through Tabula, we demonstrate the inherent limitation of employing pre-trained language models designed for natural language processing (NLP) in the context of tabular data synthesis. Our investigation delves into the development of a dedicated foundational model tailored specifically for tabular data synthesis. Additionally, we propose a token sequence compression strategy to significantly reduce training time while preserving the quality of synthetic data. Extensive experiments on six datasets demonstrate that using a language model structure without loading the well-trained model weights yields a better starting model for tabular data synthesis. Moreover, the Tabula model, previously trained on other tabular data, serves as an excellent foundation model for new tabular data synthesis tasks. Additionally, the token sequence compression method substantially reduces the model's training time. Results show that Tabula averagely reduces 46.2% training time per epoch comparing to current LLMs-based state-of-the-art algorithm and consistently achieves even higher synthetic data utility.
翻译:鉴于表格数据在各行业的广泛应用以及数据隐私与安全问题的日益关注,表格数据合成已成为一个关键研究领域。最新先进方法表明,大型语言模型(LLMs)可用于生成真实的表格数据。由于LLMs将表格数据预处理为完整文本,它们具有避免因对高维数据进行独热编码而导致的维度灾难的优势。然而,其较长的训练时间和在新任务上有限的复用性阻碍了其取代现有表格生成模型。本文提出Tabula,一种基于语言模型结构的表格数据合成器。通过Tabula,我们展示了将专为自然语言处理(NLP)设计的预训练语言模型应用于表格数据合成的固有局限性。我们的研究深入探讨了专为表格数据合成定制的专用基础模型的开发。此外,我们提出了一种标记序列压缩策略,以显著减少训练时间,同时保持合成数据的质量。在六个数据集上的大量实验表明,使用语言模型结构而不加载预训练模型权重,能为表格数据合成提供更好的起始模型。此外,先前在其他表格数据上训练过的Tabula模型,可作为新表格数据合成任务的基础模型。标记序列压缩方法大幅减少了模型的训练时间。结果表明,与当前基于LLMs的最先进算法相比,Tabula平均每个训练周期减少46.2%的训练时间,并能持续获得更高的合成数据效用。