In this study, we present a novel dataset for training machine learning models translating between OpenMP Fortran and C++ code. To ensure reliability and applicability, the dataset is initially refined using a meticulous code similarity test. The effectiveness of our dataset is assessed using both quantitative (CodeBLEU) and qualitative (human evaluation) methods. We demonstrate how this dataset can significantly improve the translation capabilities of large-scale language models, with improvements of $\mathbf{\times 5.1}$ for models with no prior coding knowledge and $\mathbf{\times 9.9}$ for models with some coding familiarity. Our work highlights the potential of this dataset to advance the field of code translation for high-performance computing. The dataset is available at https://github.com/bin123apple/Fortran-CPP-HPC-code-translation-dataset
翻译:在本研究中,我们提出了一种用于训练机器学习模型翻译OpenMP Fortran与C++代码的新颖数据集。为确保可靠性与适用性,该数据集首先通过精密的代码相似性测试进行初步筛选。我们采用定量方法(CodeBLEU)和定性方法(人工评估)对数据集的有效性进行评估。实验证明,该数据集能显著提升大规模语言模型的翻译能力:对于无编程先验知识的模型,性能提升达$\mathbf{\times 5.1}$;对于具备一定编程基础的模型,提升幅度为$\mathbf{\times 9.9}$。本工作凸显了该数据集在推动高性能计算代码翻译领域发展的潜力。数据集获取地址:https://github.com/bin123apple/Fortran-CPP-HPC-code-translation-dataset