The field of imbalanced self-supervised learning, especially in the context of tabular data, has not been extensively studied. Existing research has predominantly focused on image datasets. This paper aims to fill this gap by examining the specific challenges posed by data imbalance in self-supervised learning in the domain of tabular data, with a primary focus on autoencoders. Autoencoders are widely employed for learning and constructing a new representation of a dataset, particularly for dimensionality reduction. They are also often used for generative model learning, as seen in variational autoencoders. When dealing with mixed tabular data, qualitative variables are often encoded using a one-hot encoder with a standard loss function (MSE or Cross Entropy). In this paper, we analyze the drawbacks of this approach, especially when categorical variables are imbalanced. We propose a novel metric to balance learning: a Multi-Supervised Balanced MSE. This approach reduces the reconstruction error by balancing the influence of variables. Finally, we empirically demonstrate that this new metric, compared to the standard MSE: i) outperforms when the dataset is imbalanced, especially when the learning process is insufficient, and ii) provides similar results in the opposite case.
翻译:不平衡自监督学习领域,尤其是针对表格数据的研究尚未得到充分探索。现有研究主要聚焦于图像数据集。本文旨在填补这一空白,通过研究表格数据领域中自监督学习面临的数据不平衡具体挑战,重点关注自编码器。自编码器被广泛用于学习并构建数据集的新表示,尤其在降维任务中。同时,其在变分自编码器等生成模型学习中也常被应用。处理混合表格数据时,定性变量通常采用独热编码器结合标准损失函数(均方误差或交叉熵)进行编码。本文分析了该方法的缺陷,特别是当分类变量不平衡时。我们提出了一种用于平衡学习的新度量:多监督平衡均方误差。该方法通过平衡变量的影响来降低重构误差。最后,我们通过实验证明,与标准均方误差相比,该新度量:i) 在数据集不平衡时性能更优,尤其是当学习过程不充分时;ii) 在相反情况下可提供相似结果。