Large Language Models are becoming an increasingly popular tool for software development. Their ability to model and generate source code has been demonstrated in a variety of contexts, including code completion, summarization, translation, and lookup. However, they often struggle to generate code for more complex tasks. In this paper, we explore the ability of state-of-the-art language models to generate parallel code. We propose a benchmark, PCGBench, consisting of a set of 420 tasks for evaluating the ability of language models to generate parallel code, and we evaluate the performance of several state-of-the-art open- and closed-source language models on these tasks. We introduce novel metrics for comparing parallel code generation performance and use them to explore how well each LLM performs on various parallel programming models and computational problem types.
翻译:大型语言模型正成为软件开发的日益流行的工具。它们建模和生成源代码的能力已在代码补全、摘要、翻译和查找等多种场景中得到验证。然而,这些模型在处理更复杂任务的代码生成时往往表现不足。本文探索了当前最先进语言模型生成并行代码的能力。我们提出一个名为PCGBench的基准测试,包含420个任务,用于评估语言模型生成并行代码的能力,并在这些任务上评估了多个领先的开源和闭源语言模型的性能。我们引入了用于比较并行代码生成性能的新颖指标,并利用这些指标探讨了每个LLM在多种并行编程模型和计算问题类型上的表现。