We study the classification problem for high-dimensional data with $n$ observations on $p$ features where the $p \times p$ covariance matrix $Σ$ exhibits a spiked eigenvalue structure and the vector $ζ$, given by the difference between the {\em whitened} mean vectors, is sparse. We analyze an adaptive classifier (adaptive with respect to the sparsity $s$) that first performs dimension reduction on the feature vectors prior to classification in the dimensionally reduced space, i.e., the classifier whitens the data, then screens the features by keeping only those corresponding to the $s$ largest coordinates of $ζ$ and finally applies Fisher linear discriminant on the selected features. Leveraging recent results on entrywise matrix perturbation bounds for covariance matrices, we show that the resulting classifier is Bayes optimal whenever $n \rightarrow \infty$ and $s \sqrt{n^{-1} \ln p} \rightarrow 0$. Notably, our theory also guarantees Bayes optimality for the corresponding quadratic discriminant analysis (QDA). Experimental results on real and synthetic data further indicate that the proposed approach is competitive with state-of-the-art methods while operating on a substantially lower-dimensional representation.
翻译:我们研究高维数据的分类问题,其中包含 $n$ 个观测样本和 $p$ 个特征,且 $p \times p$ 协方差矩阵 $Σ$ 呈现尖峰特征值结构,而由{\em 白化}均值向量之差给出的向量 $ζ$ 是稀疏的。我们分析了一种自适应分类器(针对稀疏度 $s$ 自适应),该分类器首先对特征向量进行降维,然后在降维后的空间中进行分类,即分类器先将数据白化,随后通过仅保留与 $ζ$ 的 $s$ 个最大坐标对应的特征进行筛选,最后对所选特征应用 Fisher 线性判别。借助协方差矩阵逐项矩阵扰动界的最新研究成果,我们证明当 $n \rightarrow \infty$ 且 $s \sqrt{n^{-1} \ln p} \rightarrow 0$ 时,所得分类器是贝叶斯最优的。值得注意的是,我们的理论同样保证了相应二次判别分析(QDA)的贝叶斯最优性。在真实与合成数据上的实验结果表明,所提方法在显著更低维的表示上运行,同时与现有先进方法相比具有竞争力。