In this study, we present a novel dataset for training machine learning models translating between OpenMP Fortran and C++ code. To ensure reliability and applicability, the dataset is initially refined using a meticulous code similarity test. The effectiveness of our dataset is assessed using both quantitative (CodeBLEU) and qualitative (human evaluation) methods. We demonstrate how this dataset can significantly improve the translation capabilities of large-scale language models, with improvements of \times 5.1 for models with no prior coding knowledge and \times 9.9 for models with some coding familiarity. Our work highlights the potential of this dataset to advance the field of code translation for high-performance computing.
翻译:本研究提出了一个用于训练机器学习模型以在OpenMP Fortran与C++代码间进行翻译的新型数据集。为确保数据的可靠性与适用性,我们首先通过精细的代码相似性测试对数据集进行精炼。采用定量(CodeBLEU)与定性(人工评估)两种方法评估了该数据集的有效性。实验表明,本数据集能够显著提升大规模语言模型的翻译能力:对于无编程知识基础的模型,翻译性能提升达5.1倍;对于具备一定编程熟悉度的模型,提升幅度高达9.9倍。本工作彰显了该数据集在推动高性能计算领域代码翻译技术发展方面的潜力。