When data is of an extraordinarily large size or physically stored in different locations, the distributed nearest neighbor (NN) classifier is an attractive tool for classification. We propose a novel distributed adaptive NN classifier for which the number of nearest neighbors is a tuning parameter stochastically chosen by a data-driven criterion. An early stopping rule is proposed when searching for the optimal tuning parameter, which not only speeds up the computation but also improves the finite sample performance of the proposed Algorithm. Convergence rate of excess risk of the distributed adaptive NN classifier is investigated under various sub-sample size compositions. In particular, we show that when the sub-sample sizes are sufficiently large, the proposed classifier achieves the nearly optimal convergence rate. Effectiveness of the proposed approach is demonstrated through simulation studies as well as an empirical application to a real-world dataset.
翻译:当数据规模极为庞大或物理存储于不同位置时,分布式最近邻分类器是一种极具吸引力的分类工具。我们提出了一种新型分布式自适应最近邻分类器,其最近邻个数是一个由数据驱动准则随机选择的调优参数。针对该最优调优参数的搜索过程,我们提出了一项早停规则,该规则不仅加速了计算过程,还提升了所提算法的有限样本性能。我们研究了在不同子样本大小组成下,分布式自适应最近邻分类器过剩风险的收敛速率。特别地,我们证明当子样本容量足够大时,所提分类器可达到近乎最优的收敛速率。通过仿真研究以及对真实世界数据集的实证应用,验证了所提方法的有效性。