The accurate predictions and principled uncertainty measures provided by GP regression incur O(n^3) cost which is prohibitive for modern-day large-scale applications. This has motivated extensive work on computationally efficient approximations. We introduce a new perspective by exploring robustness properties and limiting behaviour of GP nearest-neighbour (GPnn) prediction. We demonstrate through theory and simulation that as the data-size n increases, accuracy of estimated parameters and GP model assumptions become increasingly irrelevant to GPnn predictive accuracy. Consequently, it is sufficient to spend small amounts of work on parameter estimation in order to achieve high MSE accuracy, even in the presence of gross misspecification. In contrast, as n tends to infinity, uncertainty calibration and NLL are shown to remain sensitive to just one parameter, the additive noise-variance; but we show that this source of inaccuracy can be corrected for, thereby achieving both well-calibrated uncertainty measures and accurate predictions at remarkably low computational cost. We exhibit a very simple GPnn regression algorithm with stand-out performance compared to other state-of-the-art GP approximations as measured on large UCI datasets. It operates at a small fraction of those other methods' training costs, for example on a basic laptop taking about 30 seconds to train on a dataset of size n = 1.6 x 10^6.
翻译:高斯过程回归提供的精确预测和基于原理的不确定性度量需要O(n^3)的计算代价,这阻碍了其在现代大规模应用中的使用。这一局限性促使学者们致力于计算高效近似方法的研究。我们通过探索高斯过程最近邻(GPnn)预测的鲁棒性性质和极限行为,提出了一种全新视角。理论和仿真表明,随着数据规模n的增加,参数估计的精度与高斯过程模型假设对GPnn预测精度的影响逐渐减弱。因此,即使存在严重模型误设,只需花费少量计算资源进行参数估计,即可获得高均方误差精度。相比之下,当n趋于无穷时,不确定性校准和负对数似然(NLL)仍对仅有的一个参数——加性噪声方差——保持敏感。但我们证明,这种不精确性可以通过校正来弥补,从而以极低的计算成本同时实现校准良好的不确定性度量和精确预测。我们提出了一种极其简单的GPnn回归算法,在大型UCI数据集上的性能优于其他最先进的高斯过程近似方法。该算法的训练成本仅为其他方法的极小部分,例如在基础笔记本电脑上,处理规模n=1.6×10^6的数据集仅需约30秒的训练时间。