Tabular data is the most widely used data format in machine learning (ML). While tree-based methods outperform DL-based methods in supervised learning, recent literature reports that self-supervised learning with Transformer-based models outperforms tree-based methods. In the existing literature on self-supervised learning for tabular data, contrastive learning is the predominant method. In contrastive learning, data augmentation is important to generate different views. However, data augmentation for tabular data has been difficult due to the unique structure and high complexity of tabular data. In addition, three main components are proposed together in existing methods: model structure, self-supervised learning methods, and data augmentation. Therefore, previous works have compared the performance without comprehensively considering these components, and it is not clear how each component affects the actual performance. In this study, we focus on data augmentation to address these issues. We propose a novel data augmentation method, $\textbf{M}$ask $\textbf{T}$oken $\textbf{R}$eplacement ($\texttt{MTR}$), which replaces the mask token with a portion of each tokenized column; $\texttt{MTR}$ takes advantage of the properties of Transformer, which is becoming the predominant DL-based architecture for tabular data, to perform data augmentation for each column embedding. Through experiments with 13 diverse public datasets in both supervised and self-supervised learning scenarios, we show that $\texttt{MTR}$ achieves competitive performance against existing data augmentation methods and improves model performance. In addition, we discuss specific scenarios in which $\texttt{MTR}$ is most effective and identify the scope of its application. The code is available at https://github.com/somaonishi/MTR/.
翻译:表格数据是机器学习(ML)中最广泛使用的数据格式。尽管在监督学习中基于树的方法优于基于深度学习(DL)的方法,但近期文献报道,基于Transformer模型的自监督学习方法已超越树方法。在现有关于表格数据自监督学习的文献中,对比学习是主流方法。对比学习中,数据增强对于生成不同视角至关重要。然而,由于表格数据独特的结构和高复杂性,其数据增强一直较为困难。此外,现有方法通常同时提出三个主要组成部分:模型结构、自监督学习方法和数据增强。因此,此前的工作在未全面考虑这些组成部分的情况下比较性能,导致各个组成部分对实际性能的影响尚不明确。本研究聚焦于数据增强以解决这些问题。我们提出一种新颖的数据增强方法——$\textbf{M}$ask $\textbf{T}$oken $\textbf{R}$eplacement($\texttt{MTR}$),该方法将掩码令牌替换为每个标记化列的部分内容;$\texttt{MTR}$利用Transformer(正成为表格数据主流DL架构)的特性,对每个列嵌入进行数据增强。通过在13个多样化公开数据集上进行的监督学习和自监督学习实验,我们证明$\texttt{MTR}$在性能上可与现有数据增强方法相媲美,并能提升模型性能。此外,我们讨论了$\texttt{MTR}$最有效的特定场景,并明确了其适用范围。代码见https://github.com/somaonishi/MTR/。