We present a novel data-driven strategy to choose the hyperparameter $k$ in the $k$-NN regression estimator without using any hold-out data. We treat the problem of choosing the hyperparameter as an iterative procedure (over $k$) and propose using an easily implemented in practice strategy based on the idea of early stopping and the minimum discrepancy principle. This model selection strategy is proven to be minimax-optimal over some smoothness function classes, for instance, the Lipschitz functions class on a bounded domain. The novel method often improves statistical performance on artificial and real-world data sets in comparison to other model selection strategies, such as the Hold-out method, 5-fold cross-validation, and AIC criterion. The novelty of the strategy comes from reducing the computational time of the model selection procedure while preserving the statistical (minimax) optimality of the resulting estimator. More precisely, given a sample of size $n$, if one should choose $k$ among $\left\{ 1, \ldots, n \right\}$, and $\left\{ f^1, \ldots, f^n \right\}$ are the estimators of the regression function, the minimum discrepancy principle requires the calculation of a fraction of the estimators, while this is not the case for the generalized cross-validation, Akaike's AIC criteria, or Lepskii principle.
翻译:本文提出一种新颖的数据驱动策略,用于在$k$-NN回归估计中选择超参数$k$,且无需使用任何保留数据。我们将超参数选择问题视为迭代过程(遍历$k$),基于早停思想和最小差异原则,提出一种易于实践实施的策略。该模型选择策略被证明在某些光滑函数类(例如有界域上的Lipschitz函数类)上具有极小极大最优性。相较于留出法、五折交叉验证和AIC准则等其他模型选择策略,新方法在人工数据集和真实数据集上通常能提升统计性能。该策略的创新性在于:在保持所得估计量统计(极小极大)最优性的同时,显著降低了模型选择过程的计算时间。具体而言,给定规模为$n$的样本,若需从$\left\{ 1, \ldots, n \right\}$中选择$k$值,且$\left\{ f^1, \ldots, f^n \right\}$为回归函数的估计量序列,最小差异原则仅需计算部分估计量,而广义交叉验证、Akaike的AIC准则或Lepskii原则均不具备此特性。