High-dimensional imbalanced data poses a machine learning challenge. In the absence of sufficient or high-quality labels, unsupervised feature selection methods are crucial for the success of subsequent algorithms. Therefore, there is a growing need for unsupervised feature selection algorithms focused on imbalanced data. Thus, we propose a Marginal Laplacian Score (MLS) a modification of the well-known Laplacian Score (LS) to be better suited for imbalance data. We introduce an assumption that the minority class or anomalous appear more frequently in the margin of the features. Consequently, MLS aims to preserve the local structure of the data set's margin. As MLS is better suited for handling imbalanced data, we propose its integration into modern feature selection methods that utilize the Laplacian score. We integrate the MLS algorithm into the Differentiable Unsupervised Feature Selection (DUFS), resulting in DUFS-MLS. The proposed methods demonstrate robust and improved performance on synthetic and public data sets.
翻译:高维不平衡数据对机器学习提出了挑战。在缺乏足够或高质量标签的情况下,无监督特征选择方法对于后续算法的成功至关重要。因此,针对不平衡数据的无监督特征选择算法需求日益增长。为此,我们提出边际拉普拉斯评分(MLS),它是对经典的拉普拉斯评分(LS)的改进,以更好地适应不平衡数据。我们引入了一个假设,即少数类或异常值更频繁地出现在特征的边缘区域。因此,MLS旨在保留数据集边缘的局部结构。由于MLS更适合处理不平衡数据,我们建议将其集成到利用拉普拉斯评分的现代特征选择方法中。我们将MLS算法集成到可微无监督特征选择(DUFS)中,从而得到DUFS-MLS。所提出的方法在合成数据集和公开数据集上均展现出稳健且更优的性能。