MPIrigen: MPI Code Generation through Domain-Specific Language Models

Nadav Schneider,Niranjan Hasabnis,Vy A. Vo,Tal Kadosh,Neva Krien,Mihai Capotă,Guy Tamir,Ted Willke,Nesreen Ahmed,Yuval Pinter,Timothy Mattson,Gal Oren

The imperative need to scale computation across numerous nodes highlights the significance of efficient parallel computing, particularly in the realm of Message Passing Interface (MPI) integration. The challenging parallel programming task of generating MPI-based parallel programs has remained unexplored. This study first investigates the performance of state-of-the-art language models in generating MPI-based parallel programs. Findings reveal that widely used models such as GPT-3.5 and PolyCoder (specialized multi-lingual code models) exhibit notable performance degradation, when generating MPI-based programs compared to general-purpose programs. In contrast, domain-specific models such as MonoCoder, which are pretrained on MPI-related programming languages of C and C++, outperform larger models. Subsequently, we introduce a dedicated downstream task of MPI-based program generation by fine-tuning MonoCoder on HPCorpusMPI. We call the resulting model as MPIrigen. We propose an innovative preprocessing for completion only after observing the whole code, thus enabling better completion with a wider context. Comparative analysis against GPT-3.5 zero-shot performance, using a novel HPC-oriented evaluation method, demonstrates that MPIrigen excels in generating accurate MPI functions up to 0.8 accuracy in location and function predictions, and with more than 0.9 accuracy for argument predictions. The success of this tailored solution underscores the importance of domain-specific fine-tuning in optimizing language models for parallel computing code generation, paving the way for a new generation of automatic parallelization tools. The sources of this work are available at our GitHub MPIrigen repository: https://github.com/Scientific-Computing-Lab-NRCN/MPI-rigen

翻译：大规模计算在众多节点上的扩展需求凸显了高效并行计算的重要性，尤其是在消息传递接口（MPI）集成领域。生成基于MPI的并行程序这一挑战性并行编程任务此前尚未被探索。本研究首先考察了最先进语言模型在生成基于MPI的并行程序方面的性能。结果表明，广泛使用的模型（如GPT-3.5和PolyCoder）——即专门的多语言代码模型——在生成基于MPI的程序时，与通用程序相比表现出显著的性能下降。相比之下，领域特定模型（如MonoCoder）——在MPI相关编程语言（C和C++）上预训练——性能优于更大规模的模型。随后，我们通过在高性能计算语料库MPI（HPCorpusMPI）上微调MonoCoder，提出了一项专门的基于MPI程序的生成下游任务。我们将所得模型命名为MPIrigen。我们提出了一种创新的预处理方法，仅在观测到完整代码后才进行补全，从而利用更广泛的上下文实现更好的补全。与GPT-3.5零样本性能的比较分析（使用新颖的面向高性能计算的评估方法）表明，MPIrigen在生成准确的MPI函数方面表现出色：位置和函数预测准确率高达0.8，参数预测准确率超过0.9。这种定制化解决方案的成功凸显了领域特定微调在优化语言模型以生成并行计算代码中的重要性，为新一代自动并行化工具铺平了道路。本工作的源代码可在我们的GitHub MPIrigen仓库中获取：https://github.com/Scientific-Computing-Lab-NRCN/MPI-rigen

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日