Gaussian process (GP) hyperparameter optimization requires repeatedly solving linear systems with $n \times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative numerical methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the corresponding kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ space without partitioning. Thus, while CG increases the size of datasets GPs can be trained on, modern datasets reach scales beyond its applicability. In this work, we propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling \emph{mini-batching}. Our algorithm, based on alternating projection, has $\mathcal{O}(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets. Theoretically, we prove our method enjoys linear convergence and empirically we demonstrate its robustness to ill-conditioning. On large-scale benchmark datasets up to four million datapoints our approach accelerates training by a factor of 2$\times$ to 27$\times$ compared to CG.
翻译:高斯过程(GP)超参数优化需要反复求解涉及$n \times n$核矩阵的线性系统。为应对$\mathcal{O}(n^3)$时间复杂度的过高代价,近期研究采用了快速迭代数值方法(如共轭梯度法CG)。然而,随着数据集规模增大,对应核矩阵的病态性加剧,且若不进行分块处理仍需$\mathcal{O}(n^2)$空间复杂度。因此,虽然CG拓展了GP可训练的数据集规模,但现代数据集已超出其适用范围。本文提出一种仅需访问核矩阵子块的迭代方法,有效实现了\textemph{小批量处理}。基于交替投影的算法具有$\mathcal{O}(n)$的单次迭代时间与空间复杂度,解决了将GP扩展至极大规模数据集面临的诸多实践难题。理论上我们证明了该方法具有线性收敛性,实验上验证了其对病态问题的鲁棒性。在包含四百万数据点的大规模基准数据集上,本方法相较CG实现了2倍至27倍的训练加速。