In the Big Data era, with the ubiquity of geolocation sensors in particular, massive datasets exhibiting a possibly complex spatial dependence structure are becoming increasingly available. In this context, the standard probabilistic theory of statistical learning does not apply directly and guarantees of the generalization capacity of predictive rules learned from such data are left to establish. We analyze here the simple Kriging task from a statistical learning perspective, i.e. by carrying out a nonparametric finite-sample predictive analysis. Given $d\geq 1$ values taken by a realization of a square integrable random field $X=\{X_s\}_{s\in S}$, $S\subset \mathbb{R}^2$, with unknown covariance structure, at sites $s_1,\; \ldots,\; s_d$ in $S$, the goal is to predict the unknown values it takes at any other location $s\in S$ with minimum quadratic risk. The prediction rule being derived from a training spatial dataset: a single realization $X'$ of $X$, independent from those to be predicted, observed at $n\geq 1$ locations $\sigma_1,\; \ldots,\; \sigma_n$ in $S$. Despite the connection of this minimization problem with kernel ridge regression, establishing the generalization capacity of empirical risk minimizers is far from straightforward, due to the non independent and identically distributed nature of the training data $X'_{\sigma_1},\; \ldots,\; X'_{\sigma_n}$ involved in the learning procedure. In this article, non-asymptotic bounds of order $O_{\mathbb{P}}(1/\sqrt{n})$ are proved for the excess risk of a plug-in predictive rule mimicking the true minimizer in the case of isotropic stationary Gaussian processes, observed at locations forming a regular grid in the learning stage. These theoretical results are illustrated by various numerical experiments, on simulated data and on real-world datasets.
翻译:在大数据时代,随着地理定位传感器的普及,呈现出复杂空间依赖结构的大规模数据集日益可得。在此背景下,标准统计学习的概率论框架无法直接适用,而基于此类数据所习得预测规则的泛化能力保障仍有待建立。本文从统计学习视角分析简单克里金法,即通过非参数有限样本预测分析展开研究。给定平方可积随机场$X=\{X_s\}_{s\in S}$($S\subset \mathbb{R}^2$,协方差结构未知)在位置$s_1,\; \ldots,\; s_d\in S$处的$d\geq 1$个观测值,目标是在最小化二次风险条件下预测该随机场在任意其他位置$s\in S$处的未知值。预测规则基于训练空间数据集推导:即与待预测值独立的$X$的单次实现$X'$,该实现观测于$S$中的$n\geq 1$个位置$\sigma_1,\; \ldots,\; \sigma_n$。尽管该最小化问题与核岭回归存在关联,但由于学习过程中涉及的非独立同分布训练数据$X'_{\sigma_1},\; \ldots,\; X'_{\sigma_n}$,建立经验风险最小化器的泛化能力绝非易事。本文针对各向同性平稳高斯过程(在训练阶段观测点构成规则网格的情形),证明了模仿真实最小化器的插入式预测规则的超额风险的非渐近界$O_{\mathbb{P}}(1/\sqrt{n})$。通过模拟数据与真实数据集的数值实验验证了上述理论结果。