We present our experience in porting optimized CUDA implementations to oneAPI. We focus on the use case of numerical integration, particularly the CUDA implementations of PAGANI and $m$-Cubes. We faced several challenges that caused performance degradation in the oneAPI ports. These include differences in utilized registers per thread, compiler optimizations, and mappings of CUDA library calls to oneAPI equivalents. After addressing those challenges, we tested both the PAGANI and m-Cubes integrators on numerous integrands of various characteristics. To evaluate the quality of the ports, we collected performance metrics of the CUDA and oneAPI implementations on the Nvidia V100 GPU. We found that the oneAPI ports often achieve comparable performance to the CUDA versions, and that they are at most 10% slower.
翻译:我们介绍了将优化的CUDA实现移植到oneAPI的经验。重点关注数值积分场景,特别是PAGANI和$m$-Cubes的CUDA实现。我们面临多个导致oneAPI移植版本性能下降的挑战,包括每线程寄存器使用量的差异、编译器优化策略,以及CUDA库函数调用到oneAPI等效函数的映射问题。在解决这些挑战后,我们在具有不同特征的多个被积函数上测试了PAGANI和m-Cubes积分器。为评估移植质量,我们收集了Nvidia V100 GPU上CUDA和oneAPI实现的性能指标。结果表明,oneAPI移植版本通常能达到与CUDA版本相当的性能,且性能降级最多不超过10%。