Tabular data is the most widely used data format in machine learning (ML). While tree-based methods outperform DL-based methods in supervised learning, recent literature reports that self-supervised learning with Transformer-based models outperforms tree-based methods. In the existing literature on self-supervised learning for tabular data, contrastive learning is the predominant method. In contrastive learning, data augmentation is important to generate different views. However, data augmentation for tabular data has been difficult due to the unique structure and high complexity of tabular data. In addition, three main components are proposed together in existing methods: model structure, self-supervised learning methods, and data augmentation. Therefore, previous works have compared the performance without comprehensively considering these components, and it is not clear how each component affects the actual performance. In this study, we focus on data augmentation to address these issues. We propose a novel data augmentation method, $\textbf{M}$ask $\textbf{T}$oken $\textbf{R}$eplacement ($\texttt{MTR}$), which replaces the mask token with a portion of each tokenized column; $\texttt{MTR}$ takes advantage of the properties of Transformer, which is becoming the predominant DL-based architecture for tabular data, to perform data augmentation for each column embedding. Through experiments with 13 diverse public datasets in both supervised and self-supervised learning scenarios, we show that $\texttt{MTR}$ achieves competitive performance against existing data augmentation methods and improves model performance. In addition, we discuss specific scenarios in which $\texttt{MTR}$ is most effective and identify the scope of its application. The code is available at https://github.com/somaonishi/MTR/.
翻译:表格数据是机器学习(ML)中使用最广泛的数据格式。尽管在监督学习中基于树的方法优于基于深度学习的方法,但近期文献报道,基于Transformer模型的自监督学习表现已超越树方法。在现有表格数据自监督学习文献中,对比学习是主流方法。对比学习依赖数据增强以生成不同视图,然而由于表格数据独特的结构和高复杂性,其数据增强一直颇具挑战性。此外,现有方法通常同时提出三个核心组件:模型结构、自监督学习方法和数据增强。因此,以往研究未全面考虑这些组件的影响而直接比较性能,导致各组件对实际性能的具体贡献尚不明确。本研究聚焦数据增强以解决这些问题,提出新型数据增强方法$\textbf{M}$ask $\textbf{T}$oken $\textbf{R}$eplacement ($\texttt{MTR}$),该方法用各分词化列的部分内容替换掩码标记;$\texttt{MTR}$利用Transformer(正成为表格数据主流深度学习架构)的特性对各列嵌入进行数据增强。通过在13个多样化公开数据集上进行监督学习和自监督学习实验,我们证明$\texttt{MTR}$在性能上可与现有数据增强方法竞争,并能提升模型表现。此外,我们讨论了$\texttt{MTR}$最具效力的具体场景并明确了其适用范围。代码已开源:https://github.com/somaonishi/MTR/。