In the Big Data era, with the ubiquity of geolocation sensors in particular, massive datasets exhibiting a possibly complex spatial dependence structure are becoming increasingly available. In this context, the standard probabilistic theory of statistical learning does not apply directly and guarantees of the generalization capacity of predictive rules learned from such data are left to establish. We analyze here the simple Kriging task from a statistical learning perspective, i.e. by carrying out a nonparametric finite-sample predictive analysis. Given $d\geq 1$ values taken by a realization of a square integrable random field $X=\{X_s\}_{s\in S}$, $S\subset \mathbb{R}^2$, with unknown covariance structure, at sites $s_1,\; \ldots,\; s_d$ in $S$, the goal is to predict the unknown values it takes at any other location $s\in S$ with minimum quadratic risk. The prediction rule being derived from a training spatial dataset: a single realization $X'$ of $X$, independent from those to be predicted, observed at $n\geq 1$ locations $\sigma_1,\; \ldots,\; \sigma_n$ in $S$. Despite the connection of this minimization problem with kernel ridge regression, establishing the generalization capacity of empirical risk minimizers is far from straightforward, due to the non independent and identically distributed nature of the training data $X'_{\sigma_1},\; \ldots,\; X'_{\sigma_n}$ involved in the learning procedure. In this article, non-asymptotic bounds of order $O_{\mathbb{P}}(1/\sqrt{n})$ are proved for the excess risk of a plug-in predictive rule mimicking the true minimizer in the case of isotropic stationary Gaussian processes, observed at locations forming a regular grid in the learning stage. These theoretical results are illustrated by various numerical experiments, on simulated data and on real-world datasets.
翻译:在大数据时代,特别是随着地理定位传感器的普及,展现出可能复杂空间依赖结构的大规模数据集日益可得。在此背景下,统计学习的标准概率理论并不直接适用,基于此类数据所学得预测规则的泛化能力保障仍有待建立。本文从统计学习角度分析简单克里金任务,即通过非参数有限样本预测分析展开研究。给定平方可积随机场$X=\{X_s\}_{s\in S}$(其中$S\subset \mathbb{R}^2$,协方差结构未知)在位置$s_1,\; \ldots,\; s_d\in S$上取值的$d\geq 1$个实现值,目标是预测该随机场在$S$中任意其他位置$s$处的未知值,并使二次风险最小化。该预测规则源于训练空间数据集:从$X$的一次独立于待预测值实现的$X'$中,在$S$内$n\geq 1$个位置$\sigma_1,\; \ldots,\; \sigma_n$上观测得到。尽管该最小化问题与核岭回归存在关联,但由于学习过程中涉及的训练数据$X'_{\sigma_1},\; \ldots,\; X'_{\sigma_n}$不满足独立同分布特性,建立经验风险最小化器的泛化能力远非直接。针对各向同性平稳高斯过程,在学习阶段中训练观测位置形成规则网格的情况下,本文证明了模拟真实最优解的插件预测规则的过剩风险满足阶数为$O_{\mathbb{P}}(1/\sqrt{n})$的非渐近界。这些理论结果通过模拟数据及真实数据集的数值实验得到验证。