As scientific codes are ported between GPU platforms, continuous testing is required to ensure numerical robustness and identify numerical differences. Compiler-induced numerical differences occur when a program is compiled and run on different GPUs, and the numerical outcomes are different for the same input. We present a study of compiler-induced numerical differences between NVIDIA and AMD GPUs. Our approach uses Varity to generate thousands of short numerical tests in CUDA and HIP, and their inputs; then, we use differential testing to check if the program produced a numerical inconsistency when run on these GPUs. We also use the HIPIFY tool to convert CUDA tests into HIP and check if there are numerical inconsistencies induced by HIPIFY. We generated more than 600,000 tests and found subtle numerical differences that come from (1) math library calls, (2) differences in floating-point precision (FP64 versus FP32), and (3) converting code with HIPIFY.
翻译:随着科学计算代码在不同GPU平台间移植,需要持续测试以确保数值鲁棒性并识别数值差异。当程序在不同GPU上编译运行时,若相同输入产生不同的数值结果,即出现编译器引发的数值差异。本文研究了NVIDIA与AMD GPU之间由编译器引发的数值差异。我们采用Varity工具生成数千个CUDA和HIP短数值测试用例及其输入数据,随后通过差分测试检验程序在这些GPU上运行时是否产生数值不一致现象。同时利用HIPIFY工具将CUDA测试转换为HIP版本,并检测HIPIFY转换是否引发数值不一致。我们生成了超过60万个测试用例,发现了源自以下三方面的微妙数值差异:(1) 数学库函数调用,(2) 浮点精度差异(FP64与FP32),以及(3) 通过HIPIFY进行的代码转换。