GPU-Resident Gaussian Process Regression Leveraging Asynchronous Tasks with HPX

Gaussian processes (GPs) are a widely used regression tool, but the cubic complexity of exact solvers limits their scalability. To address this challenge, we extend the GPRat library by incorporating a fully GPU-resident GP prediction pipeline. GPRat is an HPX-based library that combines task-based parallelism with an intuitive Python API. We implement tiled algorithms for the GP prediction using optimized CUDA libraries, thereby exploiting massive parallelism for linear algebra operations. We evaluate the optimal number of CUDA streams and compare the performance of our GPU implementation to the existing CPU-based implementation. Our results show the GPU implementation provides speedups for datasets larger than 128 training samples. We observe speedups of up to 4.3 for the Cholesky decomposition itself and 4.6 for the GP prediction. Furthermore, combining HPX with multiple CUDA streams allows GPRat to match, and for large datasets, surpass cuSOLVER's performance by up to 11 percent.

翻译：高斯过程（GPs）是一种广泛使用的回归工具，但精确求解器的三次方复杂度限制了其可扩展性。为应对这一挑战，我们扩展了GPRat库，引入了一个完全驻留于GPU的高斯过程预测流水线。GPRat是一个基于HPX的库，它将基于任务的并行性与直观的Python API相结合。我们使用优化的CUDA库为高斯过程预测实现了分块算法，从而为线性代数运算利用大规模并行性。我们评估了最优的CUDA流数量，并将我们的GPU实现性能与现有的基于CPU的实现进行了比较。结果表明，对于训练样本数超过128的数据集，GPU实现能够提供加速。我们观察到Cholesky分解本身最高可加速4.3倍，高斯过程预测最高可加速4.6倍。此外，将HPX与多个CUDA流结合使用，使得GPRat能够匹配并在大规模数据集上超越cuSOLVER的性能，最高可达11%。

相关内容

高斯过程

关注 6

高斯过程（Gaussian Process, GP）是概率论和数理统计中随机过程（stochastic process）的一种，是一系列服从正态分布的随机变量（random variable）在一指数集（index set）内的组合。高斯过程中任意随机变量的线性组合都服从正态分布，每个有限维分布都是联合正态分布，且其本身在连续指数集上的概率密度函数即是所有随机变量的高斯测度，因此被视为联合正态分布的无限维广义延伸。高斯过程由其数学期望和协方差函数完全决定，并继承了正态分布的诸多性质

【剑桥博士论文】可扩展高斯过程：迭代方法与路径条件的进展

专知会员服务

16+阅读 · 2025年7月10日

水下通信《通信感知、可扩展高斯过程在分布式探索中的应用》186页

专知会员服务

20+阅读 · 2025年4月30日

南洋理工北大等首篇《GPU数据中心中深度学习工作负载调度》综述论文，35页pdf全面阐述DL训练与推理GPU调度技术进展

专知会员服务

46+阅读 · 2022年5月27日

【干货书】使用高斯过程模型的动态系统建模与控制，281页pdf

专知会员服务

55+阅读 · 2022年5月23日