Message Passing Interface (MPI) plays a crucial role in distributed memory parallelization across multiple nodes. However, parallelizing MPI code manually, and specifically, performing domain decomposition, is a challenging, error-prone task. In this paper, we address this problem by developing MPI-RICAL, a novel data-driven, programming-assistance tool that assists programmers in writing domain decomposition based distributed memory parallelization code. Specifically, we train a supervised language model to suggest MPI functions and their proper locations in the code on the fly. We also introduce MPICodeCorpus, the first publicly available corpus of MPI-based parallel programs that is created by mining more than 15,000 open-source repositories on GitHub. Experimental results have been done on MPICodeCorpus and more importantly, on a compiled benchmark of MPI-based parallel programs for numerical computations that represent real-world scientific applications. MPI-RICAL achieves F1 scores between 0.87-0.91 on these programs, demonstrating its accuracy in suggesting correct MPI functions at appropriate code locations.. The source code used in this work, as well as other relevant sources, are available at: https://github.com/Scientific-Computing-Lab-NRCN/MPI-rical
翻译:消息传递接口(MPI)在跨多节点的分布式内存并行化中扮演关键角色。然而,手动编写MPI并行代码,特别是进行域分解,是一项具有挑战性且易出错的任务。本文通过开发MPI-RICAL——一种新型数据驱动的编程辅助工具,帮助程序员编写基于域分解的分布式内存并行化代码。具体而言,我们训练了一个有监督的语言模型,能够实时建议MPI函数及其在代码中的适当位置。我们还引入了MPICodeCorpus,这是首个公开可用的基于MPI的并行程序语料库,通过挖掘GitHub上超过15,000个开源仓库构建而成。实验在MPICodeCorpus上执行,更重要的是,在一个为数值计算(代表真实科学应用)编译的基于MPI的并行程序基准测试上进行了验证。MPI-RICAL在这些程序上的F1分数达到0.87-0.91,展示了其在正确代码位置建议合适MPI函数方面的准确性。本工作中使用的源代码及其他相关资料,可在以下网址获取:https://github.com/Scientific-Computing-Lab-NRCN/MPI-rical