We demonstrate a high-performance vendor-agnostic method for massively parallel solving of ensembles of ordinary differential equations (ODEs) and stochastic differential equations (SDEs) on GPUs. The method is integrated with a widely used differential equation solver library in a high-level language (Julia's DifferentialEquations.jl) and enables GPU acceleration without requiring code changes by the user. Our approach achieves state-of-the-art performance compared to hand-optimized CUDA-C++ kernels while performing 20--100$\times$ faster than the vectorizing map (vmap) approach implemented in JAX and PyTorch. Performance evaluation on NVIDIA, AMD, Intel, and Apple GPUs demonstrates performance portability and vendor-agnosticism. We show composability with MPI to enable distributed multi-GPU workflows. The implemented solvers are fully featured -- supporting event handling, automatic differentiation, and incorporation of datasets via the GPU's texture memory -- allowing scientists to take advantage of GPU acceleration on all major current architectures without changing their model code and without loss of performance. We distribute the software as an open-source library https://github.com/SciML/DiffEqGPU.jl
翻译:我们提出了一种高性能且与厂商无关的方法,用于在GPU上大规模并行求解常微分方程(ODE)和随机微分方程(SDE)的集合。该方法集成于高级语言中广泛使用的微分方程求解器库(Julia的DifferentialEquations.jl),且无需用户修改代码即可实现GPU加速。与手动优化的CUDA-C++内核相比,我们的方法达到了最先进的性能,同时比JAX和PyTorch中实现的向量化映射(vmap)方法快20–100倍。在NVIDIA、AMD、Intel和Apple GPU上的性能评估表明该方法具有性能可移植性与厂商无关性。我们还展示了其与MPI的可组合性,以支持分布式多GPU工作流。所实现的求解器功能完备——支持事件处理、自动微分以及通过GPU纹理内存集成数据集——使科学家能够在所有主流架构上利用GPU加速,而无需更改模型代码或牺牲性能。我们将该软件作为开源库发布,地址为https://github.com/SciML/DiffEqGPU.jl