The accurate predictions and principled uncertainty measures provided by GP regression incur O(n^3) cost which is prohibitive for modern-day large-scale applications. This has motivated extensive work on computationally efficient approximations. We introduce a new perspective by exploring robustness properties and limiting behaviour of GP nearest-neighbour (GPnn) prediction. We demonstrate through theory and simulation that as the data-size n increases, accuracy of estimated parameters and GP model assumptions become increasingly irrelevant to GPnn predictive accuracy. Consequently, it is sufficient to spend small amounts of work on parameter estimation in order to achieve high MSE accuracy, even in the presence of gross misspecification. In contrast, as n tends to infinity, uncertainty calibration and NLL are shown to remain sensitive to just one parameter, the additive noise-variance; but we show that this source of inaccuracy can be corrected for, thereby achieving both well-calibrated uncertainty measures and accurate predictions at remarkably low computational cost. We exhibit a very simple GPnn regression algorithm with stand-out performance compared to other state-of-the-art GP approximations as measured on large UCI datasets. It operates at a small fraction of those other methods' training costs, for example on a basic laptop taking about 30 seconds to train on a dataset of size n = 1.6 x 10^6.
翻译:高斯过程回归所提供的精确预测与严谨不确定性度量需要O(n^3)的计算开销,这对于现代大规模应用而言难以承受。这促使学界广泛研究计算高效的近似方法。本文通过探索高斯过程最近邻预测的鲁棒性特征与极限行为,提出了一种全新视角。理论与仿真表明:随着数据量n增大,参数估计精度及GP模型假设对GPnn预测准确性的影响逐渐减弱。因此,即使存在严重的模型设定错误,仅需少量参数估计工作即可实现高MSE精度。相反,当n趋于无穷时,不确定性校准与负对数似然被证明仅对单个参数(加性噪声方差)保持敏感;但我们同时证明该不准确性可被修正,从而以极低计算成本同时实现校准良好的不确定性度量与精确预测。我们展示了一个极其简单的GPnn回归算法,在大型UCI数据集上的性能优于其他最先进的GP近似方法。该算法训练成本仅为这些方法的极小部分,例如在基础笔记本上处理n=1.6×10^6的数据集仅需约30秒。