Gaussian process ($GP$) regression is a widely used non-parametric modeling tool, but its cubic complexity in the training size limits its use on massive data sets. A practical remedy is to predict using only the nearest neighbours of each test point, as in Nearest Neighbour Gaussian Process ($NNGP$) regression for geospatial problems and the related scalable $GPnn$ method for more general machine-learning applications. Despite their strong empirical performance, the large-$n$ theory of $NNGP/GPnn$ remains incomplete. We develop a theoretical framework for $NNGP$ and $GPnn$ regression. Under mild regularity assumptions, we derive almost sure pointwise limits for three key predictive criteria: mean squared error ($MSE$), calibration coefficient ($CAL$), and negative log-likelihood ($NLL$). We then study the $L_2$-risk, prove universal consistency, and show that the risk attains Stone's minimax rate $n^{-2α/(2p+d)}$, where $α$ and $p$ capture regularity of the regression problem. We also prove uniform convergence of $MSE$ over compact hyper-parameter sets and show that its derivatives with respect to lengthscale, kernel scale, and noise variance vanish asymptotically, with explicit rates. This explains the observed robustness of $GPnn$ to hyper-parameter tuning. These results provide a rigorous statistical foundation for $NNGP/GPnn$ as a highly scalable and principled alternative to full $GP$ models.
翻译:高斯过程($GP$)回归是一种广泛使用的非参数建模工具,但其训练规模的三次方复杂度限制了其在大型数据集上的应用。一种实用的补救方法是对每个测试点仅使用其最近邻进行预测,例如地理空间问题中的最近邻高斯过程($NNGP$)回归,以及更通用的机器学习应用中的可扩展$GPnn$方法。尽管这些方法在实证中表现优异,但其在大样本情况下的理论仍不完善。我们为$NNGP$和$GPnn$回归建立了一个理论框架。在温和的正则性假设下,我们推导了三个关键预测指标的几乎必然逐点极限:均方误差($MSE$)、校准系数($CAL$)和负对数似然($NLL$)。随后,我们研究了$L_2$风险,证明了其通用一致性,并表明该风险达到了Stone的极小化最优速率$n^{-2α/(2p+d)}$,其中$α$和$p$刻画了回归问题的正则性。我们还证明了$MSE$在紧超参数集上的一致收敛性,并表明其对长度尺度、核尺度和噪声方差的导数随样本量增大渐近消失,且给出了显式速率。这解释了$GPnn$对超参数调优的鲁棒性。这些结果为$NNGP/GPnn$作为完整$GP$模型的高度可扩展且原则性的替代方案提供了严格的统计基础。