MPIrigen: MPI Code Generation through Domain-Specific Language Models

Nadav Schneider,Niranjan Hasabnis,Vy A. Vo,Tal Kadosh,Neva Krien,Mihai Capotă,Abdul Wasay,Guy Tamir,Ted Willke,Nesreen Ahmed,Yuval Pinter,Timothy Mattson,Gal Oren

The imperative need to scale computation across numerous nodes highlights the significance of efficient parallel computing, particularly in the realm of Message Passing Interface (MPI) integration. The challenging parallel programming task of generating MPI-based parallel programs has remained unexplored. This study first investigates the performance of state-of-the-art language models in generating MPI-based parallel programs. Findings reveal that widely used models such as GPT-3.5 and PolyCoder (specialized multi-lingual code models) exhibit notable performance degradation, when generating MPI-based programs compared to general-purpose programs. In contrast, domain-specific models such as MonoCoder, which are pretrained on MPI-related programming languages of C and C++, outperform larger models. Subsequently, we introduce a dedicated downstream task of MPI-based program generation by fine-tuning MonoCoder on HPCorpusMPI. We call the resulting model as MPIrigen. We propose an innovative preprocessing for completion only after observing the whole code, thus enabling better completion with a wider context. Comparative analysis against GPT-3.5 zero-shot performance, using a novel HPC-oriented evaluation method, demonstrates that MPIrigen excels in generating accurate MPI functions up to 0.8 accuracy in location and function predictions, and with more than 0.9 accuracy for argument predictions. The success of this tailored solution underscores the importance of domain-specific fine-tuning in optimizing language models for parallel computing code generation, paving the way for a new generation of automatic parallelization tools. The sources of this work are available at our GitHub MPIrigen repository: https://github.com/Scientific-Computing-Lab-NRCN/MPI-rigen

翻译：跨多节点规模化计算的迫切需求凸显了高效并行计算的重要性，尤其是在消息传递接口（MPI）集成领域。然而基于MPI的并行程序生成这一具有挑战性的并行编程任务至今尚未被充分探索。本研究首先探究了现有先进语言模型在生成基于MPI的并行程序时的性能表现。研究发现，GPT-3.5和PolyCoder（专用多语言代码模型）等广泛使用的模型在生成基于MPI的程序时，其性能相较于通用程序出现显著下降。相反，领域特定模型（如MonoCoder）经过C和C++等MPI相关编程语言预训练后，其表现优于更大规模的模型。随后，我们通过针对HPCorpusMPI数据集微调MonoCoder，构建了一个专门的基于MPI程序生成的下游任务，并将所得模型命名为MPIrigen。我们提出了一种创新的预处理方法——仅在观察完整代码后进行补全，从而在更广阔的上下文中实现更优质的补全效果。与GPT-3.5零样本性能的对比分析（采用新型高性能计算导向的评估方法）表明，MPIrigen在生成准确的MPI函数方面表现出色：位置和函数预测准确率可达0.8，参数预测准确率超过0.9。这种定制化方案的成功凸显了领域特定微调在优化面向并行计算代码生成的语言模型中的关键作用，为新一代自动化并行化工具开辟了道路。本工作源代码可在我们的GitHub仓库MPIrigen中获取：https://github.com/Scientific-Computing-Lab-NRCN/MPI-rigen

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日