We present a novel data-driven strategy to choose the hyperparameter $k$ in the $k$-NN regression estimator without using any hold-out data. We treat the problem of choosing the hyperparameter as an iterative procedure (over $k$) and propose using an easily implemented in practice strategy based on the idea of early stopping and the minimum discrepancy principle. This model selection strategy is proven to be minimax-optimal, under the fixed-design assumption on covariates, over some smoothness function classes, for instance, the Lipschitz functions class on a bounded domain. The novel method often improves statistical performance on artificial and real-world data sets in comparison to other model selection strategies, such as the Hold-out method, 5-fold cross-validation, and AIC criterion. The novelty of the strategy comes from reducing the computational time of the model selection procedure while preserving the statistical (minimax) optimality of the resulting estimator. More precisely, given a sample of size $n$, if one should choose $k$ among $\left\{ 1, \ldots, n \right\}$, and $\left\{ f^1, \ldots, f^n \right\}$ are the estimators of the regression function, the minimum discrepancy principle requires calculation of a fraction of the estimators, while this is not the case for the generalized cross-validation, Akaike's AIC criteria or Lepskii principle.
翻译:我们提出了一种新颖的数据驱动策略,用于在$k$-NN回归估计器中自动选择超参数$k$,且无需使用任何留出数据。我们将超参数选择问题视为一个迭代过程(遍历$k$),并基于早停法与最小偏差原理的思想,提出了一种易于在实际中实现的策略。在协变量固定设计假设下,该模型选择策略被证明在若干光滑函数类(例如有界域上的Lipschitz函数类)上达到极小化最优。与留出法、5折交叉验证和AIC准则等其他模型选择策略相比,新方法在人工数据集和真实数据集上通常能提升统计性能。该策略的创新之处在于:在保持所得估计器统计(极小化)最优性的同时,显著降低了模型选择过程的计算时间。具体而言,对于容量为$n$的样本,若需从$\left\{ 1, \ldots, n \right\}$中选择$k$,且$\left\{ f^1, \ldots, f^n \right\}$为回归函数的估计量序列,则最小偏差原理仅需计算其中一部分估计量,而广义交叉验证、Akaike的AIC准则或Lepskii原理均需计算全部估计量。