This paper addresses the problem of providing a novel approach to sourcing significant training data for LLMs focused on science and engineering. In particular, a crucial challenge is sourcing parallel scientific codes in the ranges of millions to billions of codes. To tackle this problem, we propose an automated pipeline framework, called LASSI, designed to translate between parallel programming languages by bootstrapping existing closed- or open-source LLMs. LASSI incorporates autonomous enhancement through self-correcting loops where errors encountered during compilation and execution of generated code are fed back to the LLM through guided prompting for debugging and refactoring. We highlight the bi-directional translation of existing GPU benchmarks between OpenMP target offload and CUDA to validate LASSI. The results of evaluating LASSI with different application codes across four LLMs demonstrate the effectiveness of LASSI for generating executable parallel codes, with 80% of OpenMP to CUDA translations and 85% of CUDA to OpenMP translations producing the expected output. We also observe approximately 78% of OpenMP to CUDA translations and 62% of CUDA to OpenMP translations execute within 10% of or at a faster runtime than the original benchmark code in the same language.
翻译:本文旨在解决为专注于科学与工程领域的大语言模型(LLM)提供大规模训练数据的问题。具体而言,一个关键挑战在于如何获取数百万至数十亿量级的并行科学代码。为应对此问题,我们提出了一种自动化流程框架LASSI,该框架通过引导现有闭源或开源LLM,实现并行编程语言间的相互翻译。LASSI通过自校正循环实现自主增强:在生成代码的编译与执行过程中遇到的错误,将通过引导式提示反馈给LLM以进行调试与重构。我们通过OpenMP目标卸载与CUDA之间现有GPU基准测试的双向翻译来验证LASSI。使用四种不同LLM对多种应用代码进行评估的结果表明,LASSI能有效生成可执行的并行代码——其中80%的OpenMP至CUDA翻译与85%的CUDA至OpenMP翻译可产生预期输出。我们还观察到约78%的OpenMP至CUDA翻译与62%的CUDA至OpenMP翻译,其运行时间相比原始同语言基准代码的差异在10%以内或更快。