Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran-to-C++ and C++-to-CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show that the generated data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.
翻译:大型语言模型(LLMs)在代码翻译方面展现了卓越能力,但在Fortran等低资源编程领域以及CUDA等新兴框架中,由于缺乏高质量平行数据,其性能显著下降。我们提出了一种自动化数据集生成流程,采用双LLM问答器-求解器设计,整合了编译器与运行时反馈的外部知识。区别于传统的源-目标代码对数据集,本方法额外生成:(1) 通过单元测试验证的翻译结果,用于评估功能一致性;(2) 多轮对话,捕捉翻译优化过程中的推理逻辑。将该流程应用于Fortran到C++及C++到CUDA的翻译任务,分别生成3.64k和3.93k个对话。基于此数据微调模型后,功能正确性显著提升——在极具挑战性的C++到CUDA任务中,单元测试通过率提升超过56%。实验表明,生成的数据使7B开放权重模型在编译成功率等关键指标上大幅超越更大规模闭源系统。