Support vector machine (SVM) is a popular classifier known for accuracy, flexibility, and robustness. However, its intensive computation has hindered its application to large-scale datasets. In this paper, we propose a new optimal leverage classifier based on linear SVM under a nonseparable setting. Our classifier aims to select an informative subset of the training sample to reduce data size, enabling efficient computation while maintaining high accuracy. We take a novel view of SVM under the general subsampling framework and rigorously investigate the statistical properties. We propose a two-step subsampling procedure consisting of a pilot estimation of the optimal subsampling probabilities and a subsampling step to construct the classifier. We develop a new Bahadur representation of the SVM coefficients and derive unconditional asymptotic distribution and optimal subsampling probabilities without giving the full sample. Numerical results demonstrate that our classifiers outperform the existing methods in terms of estimation, computation, and prediction.
翻译:支持向量机(SVM)是一种以精度、灵活性和鲁棒性著称的流行分类器。然而,其密集的计算量限制了其在大规模数据集上的应用。本文提出了一种基于线性SVM在不可分情形下的新型最优杠杆分类器。该分类器旨在从训练样本中选择信息量大的子集以缩减数据规模,从而在保持高精度的同时实现高效计算。我们从一般子抽样框架出发,对SVM进行了创新性审视,并严格研究了其统计性质。我们提出了一种两步子抽样流程:首先通过试点估计获得最优子抽样概率,随后进行子抽样步骤构建分类器。我们建立了SVM系数的新巴哈杜尔表示,推导了无全样本条件下的无条件渐近分布与最优子抽样概率。数值结果表明,本文提出的分类器在估计、计算和预测方面均优于现有方法。