Gaussian processes (GPs) are a widely used regression tool, but the cubic complexity of exact solvers limits their scalability. To address this challenge, we extend the GPRat library by incorporating a fully GPU-resident GP prediction pipeline. GPRat is an HPX-based library that combines task-based parallelism with an intuitive Python API. We implement tiled algorithms for the GP prediction using optimized CUDA libraries, thereby exploiting massive parallelism for linear algebra operations. We evaluate the optimal number of CUDA streams and compare the performance of our GPU implementation to the existing CPU-based implementation. Our results show the GPU implementation provides speedups for datasets larger than 128 training samples. We observe speedups of up to 4.3 for the Cholesky decomposition itself and 4.6 for the GP prediction. Furthermore, combining HPX with multiple CUDA streams allows GPRat to match, and for large datasets, surpass cuSOLVER's performance by up to 11 percent.
翻译:高斯过程(GPs)是一种广泛使用的回归工具,但精确求解器的三次方复杂度限制了其可扩展性。为应对这一挑战,我们扩展了GPRat库,引入了一个完全驻留于GPU的高斯过程预测流水线。GPRat是一个基于HPX的库,它将基于任务的并行性与直观的Python API相结合。我们使用优化的CUDA库为高斯过程预测实现了分块算法,从而为线性代数运算利用大规模并行性。我们评估了最优的CUDA流数量,并将我们的GPU实现性能与现有的基于CPU的实现进行了比较。结果表明,对于训练样本数超过128的数据集,GPU实现能够提供加速。我们观察到Cholesky分解本身最高可加速4.3倍,高斯过程预测最高可加速4.6倍。此外,将HPX与多个CUDA流结合使用,使得GPRat能够匹配并在大规模数据集上超越cuSOLVER的性能,最高可达11%。