Contrastive learning is a model pre-training technique by first creating similar views of the original data, and then encouraging the data and its corresponding views to be close in the embedding space. Contrastive learning has witnessed success in image and natural language data, thanks to the domain-specific augmentation techniques that are both intuitive and effective. Nonetheless, in tabular domain, the predominant augmentation technique for creating views is through corrupting tabular entries via swapping values, which is not as sound or effective. We propose a simple yet powerful improvement to this augmentation technique: corrupting tabular data conditioned on class identity. Specifically, when corrupting a specific tabular entry from an anchor row, instead of randomly sampling a value in the same feature column from the entire table uniformly, we only sample from rows that are identified to be within the same class as the anchor row. We assume the semi-supervised learning setting, and adopt the pseudo labeling technique for obtaining class identities over all table rows. We also explore the novel idea of selecting features to be corrupted based on feature correlation structures. Extensive experiments show that the proposed approach consistently outperforms the conventional corruption method for tabular data classification tasks. Our code is available at https://github.com/willtop/Tabular-Class-Conditioned-SSL.
翻译:对比学习是一种模型预训练技术,其核心步骤是先构建原始数据的相似视图,再促使原始数据及其对应视图在嵌入空间中相互接近。由于领域特定的数据增强技术既直观又有效,对比学习已在图像和自然语言数据领域取得成功。然而在表格数据领域,当前主流的视图构建增强技术是通过值交换来破坏表格条目,这种方法既不够完善也不够高效。本文提出一种简单但有效的改进方案:基于类别身份条件对表格数据进行破坏。具体而言,当对某锚点行的表格条目进行破坏时,我们不再从整个表中均匀随机采样同一特征列的值,而是仅从与该锚点行同类别属性标识的行中采样。我们采用半监督学习设置,并通过伪标签技术获取全表行的类别标识。此外,我们还探索了基于特征相关结构选择受破坏特征的新思路。大量实验表明,所提方法在表格数据分类任务中始终优于传统的破坏式方法。我们的代码开源在 https://github.com/willtop/Tabular-Class-Conditioned-SSL。