Self-supervision is often used for pre-training to foster performance on a downstream task by constructing meaningful representations of samples. Self-supervised learning (SSL) generally involves generating different views of the same sample and thus requires data augmentations that are challenging to construct for tabular data. This constitutes one of the main challenges of self-supervision for structured data. In the present work, we propose a novel augmentation-free SSL method for tabular data. Our approach, T-JEPA, relies on a Joint Embedding Predictive Architecture (JEPA) and is akin to mask reconstruction in the latent space. It involves predicting the latent representation of one subset of features from the latent representation of a different subset within the same sample, thereby learning rich representations without augmentations. We use our method as a pre-training technique and train several deep classifiers on the obtained representation. Our experimental results demonstrate a substantial improvement in both classification and regression tasks, outperforming models trained directly on samples in their original data space. Moreover, T-JEPA enables some methods to consistently outperform or match the performance of traditional methods likes Gradient Boosted Decision Trees. To understand why, we extensively characterize the obtained representations and show that T-JEPA effectively identifies relevant features for downstream tasks without access to the labels. Additionally, we introduce regularization tokens, a novel regularization method critical for training of JEPA-based models on structured data.
翻译:自监督学习通常用于预训练,通过构建样本的有意义表征来提升下游任务的性能。自监督学习(SSL)通常涉及生成同一样本的不同视图,因此需要数据增强技术,而这在表格数据中难以构建。这构成了结构化数据自监督学习的主要挑战之一。在本研究中,我们提出了一种新颖的免增强表格数据自监督学习方法。我们的方法T-JEPA基于联合嵌入预测架构(JEPA),其原理类似于潜在空间中的掩码重建。该方法通过同一样本内不同特征子集的潜在表征来预测另一特征子集的潜在表征,从而无需数据增强即可学习丰富的表征。我们将该方法作为预训练技术,并在获得的表征上训练多个深度分类器。实验结果表明,该方法在分类和回归任务上均取得显著提升,其性能优于直接在原始数据空间样本上训练的模型。此外,T-JEPA使某些方法能够持续超越或匹配传统方法(如梯度提升决策树)的性能。为探究其原因,我们对获得的表征进行了系统表征分析,结果表明T-JEPA能在不接触标签的情况下有效识别下游任务的相关特征。同时,我们引入了正则化标记——一种对基于JEPA的结构化数据模型训练至关重要的新型正则化方法。