Self-supervision is often used for pre-training to foster performance on a downstream task by constructing meaningful representations of samples. Self-supervised learning (SSL) generally involves generating different views of the same sample and thus requires data augmentations that are challenging to construct for tabular data. This constitutes one of the main challenges of self-supervision for structured data. In the present work, we propose a novel augmentation-free SSL method for tabular data. Our approach, T-JEPA, relies on a Joint Embedding Predictive Architecture (JEPA) and is akin to mask reconstruction in the latent space. It involves predicting the latent representation of one subset of features from the latent representation of a different subset within the same sample, thereby learning rich representations without augmentations. We use our method as a pre-training technique and train several deep classifiers on the obtained representation. Our experimental results demonstrate a substantial improvement in both classification and regression tasks, outperforming models trained directly on samples in their original data space. Moreover, T-JEPA enables some methods to consistently outperform or match the performance of traditional methods likes Gradient Boosted Decision Trees. To understand why, we extensively characterize the obtained representations and show that T-JEPA effectively identifies relevant features for downstream tasks without access to the labels. Additionally, we introduce regularization tokens, a novel regularization method critical for training of JEPA-based models on structured data.
翻译:自监督学习通常用于预训练,通过构建样本的有意义表征来提升下游任务的性能。自监督学习通常需要生成同一样本的不同视图,因此需要构建数据增强策略,而这对于表格数据而言具有挑战性。这构成了结构化数据自监督学习的主要难点之一。在本工作中,我们提出了一种新颖的、无需数据增强的表格数据自监督学习方法。我们的方法T-JEPA基于联合嵌入预测架构,类似于在潜在空间中进行掩码重建。它通过同一样本内不同特征子集的潜在表征来预测另一特征子集的潜在表征,从而无需数据增强即可学习丰富的表征。我们将该方法用作预训练技术,并在获得的表征上训练多个深度分类器。实验结果表明,在分类和回归任务上均取得了显著提升,其性能优于直接在原始数据空间样本上训练的模型。此外,T-JEPA使某些方法能够持续超越或匹配传统方法(如梯度提升决策树)的性能。为探究其原因,我们对获得的表征进行了深入分析,结果表明T-JEPA能够在不访问标签的情况下有效识别下游任务的相关特征。此外,我们引入了正则化令牌,这是一种对基于JEPA的模型在结构化数据上训练至关重要的新型正则化方法。