Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation

from arxiv, Accepted at the Sixteenth International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), 2023 to be held in conjunction with ICPP 2023: The 52nd International Conference on Parallel Processing. 10 pages, 6 figures, 5 tables

We evaluate AI-assisted generative capabilities on fundamental numerical kernels in high-performance computing (HPC), including AXPY, GEMV, GEMM, SpMV, Jacobi Stencil, and CG. We test the generated kernel codes for a variety of language-supported programming models, including (1) C++ (e.g., OpenMP [including offload], OpenACC, Kokkos, SyCL, CUDA, and HIP), (2) Fortran (e.g., OpenMP [including offload] and OpenACC), (3) Python (e.g., numba, Numba, cuPy, and pyCUDA), and (4) Julia (e.g., Threads, CUDA.jl, AMDGPU.jl, and KernelAbstractions.jl). We use the GitHub Copilot capabilities powered by OpenAI Codex available in Visual Studio Code as of April 2023 to generate a vast amount of implementations given simple <kernel> + <programming model> + <optional hints> prompt variants. To quantify and compare the results, we propose a proficiency metric around the initial 10 suggestions given for each prompt. Results suggest that the OpenAI Codex outputs for C++ correlate with the adoption and maturity of programming models. For example, OpenMP and CUDA score really high, whereas HIP is still lacking. We found that prompts from either a targeted language such as Fortran or the more general-purpose Python can benefit from adding code keywords, while Julia prompts perform acceptably well for its mature programming models (e.g., Threads and CUDA.jl). We expect for these benchmarks to provide a point of reference for each programming model's community. Overall, understanding the convergence of large language models, AI, and HPC is crucial due to its rapidly evolving nature and how it is redefining human-computer interactions.

翻译：我们评估了人工智能辅助生成在高性能计算（HPC）基础数值内核（包括AXPY、GEMV、GEMM、SpMV、Jacobi模板计算和CG）上的能力。我们测试了多种语言支持的编程模型生成的内核代码，涵盖：(1) C++（如OpenMP [含卸载]、OpenACC、Kokkos、SyCL、CUDA和HIP），(2) Fortran（如OpenMP [含卸载]和OpenACC），(3) Python（如numba、Numba、cuPy和pyCUDA），以及(4) Julia（如Threads、CUDA.jl、AMDGPU.jl和KernelAbstractions.jl）。我们利用Visual Studio Code（截至2023年4月）中由OpenAI Codex驱动的GitHub Copilot功能，针对简单的<内核> + <编程模型> + <可选提示>提示变体生成大量实现代码。为量化并比较结果，我们针对每个提示的前10个建议提出了一项熟练度指标。结果表明，OpenAI Codex针对C++的输出与编程模型的采用程度和成熟度相关。例如，OpenMP和CUDA得分极高，而HIP仍显不足。我们发现，无论是针对Fortran等特定语言还是一般性Python的提示，均可通过添加代码关键词获益，而Julia提示在其成熟编程模型（如Threads和CUDA.jl）上表现尚可。我们期望这些基准为各编程模型社区提供参考点。总体而言，理解大语言模型、人工智能与HPC的融合至关重要，因其快速演变的特性正重新定义人机交互。