We present a novel data-driven strategy to choose the hyperparameter $k$ in the $k$-NN regression estimator without using any hold-out data. We treat the problem of choosing the hyperparameter as an iterative procedure (over $k$) and propose using an easily implemented in practice strategy based on the idea of early stopping and the minimum discrepancy principle. This model selection strategy is proven to be minimax-optimal, under the fixed-design assumption on covariates, over some smoothness function classes, for instance, the Lipschitz functions class on a bounded domain. The novel method often improves statistical performance on artificial and real-world data sets in comparison to other model selection strategies, such as the Hold-out method, 5-fold cross-validation, and AIC criterion. The novelty of the strategy comes from reducing the computational time of the model selection procedure while preserving the statistical (minimax) optimality of the resulting estimator. More precisely, given a sample of size $n$, if one should choose $k$ among $\left\{ 1, \ldots, n \right\}$, and $\left\{ f^1, \ldots, f^n \right\}$ are the estimators of the regression function, the minimum discrepancy principle requires calculation of a fraction of the estimators, while this is not the case for the generalized cross-validation, Akaike's AIC criteria or Lepskii principle.
翻译:我们提出了一种新颖的数据驱动策略,用于在$k$-NN回归估计器中选取超参数$k$,且无需使用任何保留数据。我们将超参数选择问题视为一个(关于$k$的)迭代过程,并建议采用一种基于早停法和最小差异原则的易于实现的策略。在协变量满足固定设计假设的前提下,该模型选择策略被证明在若干光滑函数类(例如有界域上的Lipschitz函数类)上具有极小极大最优性。与留出法、五折交叉验证和AIC准则等其他模型选择策略相比,新方法在人工数据集和真实数据集上通常能提升统计性能。该策略的创新之处在于:在保持所得估计量统计(极小极大)最优性的同时,降低了模型选择过程的计算时间。更精确地说,给定样本量为$n$的数据,若需从$\{1,\ldots,n\}$中选择$k$值,且设$\{f^1,\ldots,f^n\}$为回归函数的估计量,则最小差异原则仅需计算其中一部分估计量,而广义交叉验证、Akaike的AIC准则或Lepskii原则则无法实现这一点。