We propose a new method for high-dimensional semi-supervised learning problems based on the careful aggregation of the results of a low-dimensional procedure applied to many axis-aligned random projections of the data. Our primary goal is to identify important variables for distinguishing between the classes; existing low-dimensional methods can then be applied for final class assignment. Motivated by a generalized Rayleigh quotient, we score projections according to the traces of the estimated whitened between-class covariance matrices on the projected data. This enables us to assign an importance weight to each variable for a given projection, and to select our signal variables by aggregating these weights over high-scoring projections. Our theory shows that the resulting Sharp-SSL algorithm is able to recover the signal coordinates with high probability when we aggregate over sufficiently many random projections and when the base procedure estimates the whitened between-class covariance matrix sufficiently well. The Gaussian EM algorithm is a natural choice as a base procedure, and we provide a new analysis of its performance in semi-supervised settings that controls the parameter estimation error in terms of the proportion of labeled data in the sample. Numerical results on both simulated data and a real colon tumor dataset support the excellent empirical performance of the method.
翻译:我们提出了一种新的高维半监督学习方法,该方法通过谨慎聚合对数据施加大量轴对齐随机投影后的低维过程结果来实现。主要目标是识别区分不同类别的重要变量;现有低维方法可随后用于最终分类任务。受广义瑞利商的启发,我们根据投影数据上估计的白化类间协方差矩阵的迹对投影进行评分。这使得我们能够为给定投影中的每个变量分配重要性权重,并通过在得分较高的投影上聚合这些权重来选择信号变量。理论分析表明,当聚合足够多的随机投影且基础过程能充分估计白化类间协方差矩阵时,得到的Sharp-SSL算法能够以高概率恢复信号坐标。高斯期望最大化算法是基础过程的自然选择,我们对其在半监督环境下的性能进行了新分析,通过样本中标记数据比例控制参数估计误差。模拟数据集和真实结肠肿瘤数据集上的数值结果均支持该方法优异的实证性能。