Distributionally robust self-supervised learning for tabular data

Machine learning (ML) models trained using Empirical Risk Minimization (ERM) often exhibit systematic errors on specific subpopulations of tabular data, known as error slices. Learning robust representation in presence of error slices is challenging, especially in self-supervised settings during the feature reconstruction phase, due to high cardinality features and the complexity of constructing error sets. Traditional robust representation learning methods are largely focused on improving worst group performance in supervised setting in computer vision, leaving a gap in approaches tailored for tabular data. We address this gap by developing a framework to learn robust representation in tabular data during self-supervised pre-training. Our approach utilizes an encoder-decoder model trained with Masked Language Modeling (MLM) loss to learn robust latent representations. This paper applies the Just Train Twice (JTT) and Deep Feature Reweighting (DFR) methods during the pre-training phase for tabular data. These methods fine-tune the ERM pre-trained model by up-weighting error-prone samples or creating balanced datasets for specific categorical features. This results in specialized models for each feature, which are then used in an ensemble approach to enhance downstream classification performance. This methodology improves robustness across slices, thus enhancing overall generalization performance. Extensive experiments across various datasets demonstrate the efficacy of our approach. The code is available: \url{https://github.com/amazon-science/distributionally-robust-self-supervised-learning-for-tabular-data}.

翻译：基于经验风险最小化（ERM）训练的机器学习（ML）模型通常在表格数据的特定子群体（称为误差切片）上表现出系统性错误。由于高基数特征和构建误差集的复杂性，在存在误差切片的情况下学习鲁棒表示具有挑战性，尤其是在特征重构阶段的自监督设置中。传统的鲁棒表示学习方法主要集中于提升计算机视觉监督设置下的最差组性能，缺乏针对表格数据定制的方法。我们通过开发一个在自监督预训练阶段学习表格数据鲁棒表示的框架来解决这一空白。我们的方法利用一个通过掩码语言建模（MLM）损失训练的编码器-解码器模型来学习鲁棒潜在表示。本文在表格数据的预训练阶段应用了"仅训练两次"（JTT）和深度特征重加权（DFR）方法。这些方法通过对易错样本进行上采样或针对特定分类特征创建平衡数据集，对ERM预训练模型进行微调。这为每个特征生成了专用模型，随后通过集成方法用于提升下游分类性能。该方法改善了跨切片的鲁棒性，从而增强了整体泛化性能。在多个数据集上的大量实验证明了我们方法的有效性。代码已公开：\url{https://github.com/amazon-science/distributionally-robust-self-supervised-learning-for-tabular-data}。