Large language models are increasingly becoming a popular tool for software development. Their ability to model and generate source code has been demonstrated in a variety of contexts, including code completion, summarization, translation, and lookup. However, they often struggle to generate code for complex programs. In this paper, we study the capabilities of state-of-the-art language models to generate parallel code. In order to evaluate language models, we create a benchmark, ParEval, consisting of prompts that represent 420 different coding tasks related to scientific and parallel computing. We use ParEval to evaluate the effectiveness of several state-of-the-art open- and closed-source language models on these tasks. We introduce novel metrics for evaluating the performance of generated code, and use them to explore how well each large language model performs for 12 different computational problem types and six different parallel programming models.
翻译:大型语言模型正日益成为软件开发中的热门工具。它们在代码补全、摘要、翻译和查询等多种场景下已被证明具备建模及生成源代码的能力。然而,这些模型在生成复杂程序代码时常常表现不佳。本文研究了当前最先进的语言模型生成并行代码的能力。为评估语言模型,我们构建了一个基准测试集ParEval,其中包含代表420项与科学计算和并行计算相关的编码任务的提示。我们利用ParEval评估了多个领先的开源及闭源语言模型在这些任务上的有效性。我们引入了用于评估生成代码性能的新颖指标,并使用这些指标深入探究了每个大型语言模型在12种不同计算问题类型和6种不同并行编程模型上的表现。