Code translation aims to translate the code from its source language to the target language and is used in various software development scenarios. Recent developments in Large Language Models (LLMs) have showcased their capabilities in code translation, and parallel corpora play a crucial role in training models for code translation. Parallel corpora can be categorized into program-alignment (PA) and snippet-alignment (SA) data. Although PA data has complete context and is suitable for semantic alignment learning, it may not provide adequate fine-grained training signals due to its extended length, while the brevity of SA data enables more fine-grained alignment learning. Due to limited parallel corpora, researchers explore several augmentation methods for code translation. Previous studies mainly focus on augmenting PA data. In this paper, we propose a data augmentation method that leverages LLMs to generate SA data automatically. To fully leverage both PA data and SA data, we explore a simple yet effective two-stage training strategy, which consistently enhances model performance compared to fine-tuning solely on PA data. Experiments on TransCoder-test demonstrate that our augmented SA data combined with the two-stage training approach yields consistent improvements over the baseline, achieving a maximum gain of 3.78% on pass@k.
翻译:代码翻译旨在将代码从源语言转换为目标语言,广泛应用于各类软件开发场景。近年来,大型语言模型(LLMs)在代码翻译任务中展现出卓越能力,而平行语料库对训练代码翻译模型至关重要。平行语料库可分为程序对齐(PA)数据与片段对齐(SA)数据两类。尽管PA数据具备完整上下文,适用于语义对齐学习,但其较长篇幅可能无法提供充分的细粒度训练信号;而SA数据因其简洁性,可实现更细粒度的对齐学习。由于平行语料库规模有限,研究者探索了多种代码翻译数据增强方法。现有研究主要集中于增强PA数据。本文提出一种利用LLM自动生成SA数据的数据增强方法。为充分发挥PA数据与SA数据的优势,我们探索了一种简单有效的两阶段训练策略,相较于仅使用PA数据进行微调,该方法能持续提升模型性能。在TransCoder-test数据集上的实验表明,我们增强的SA数据结合两阶段训练方法,相比基线模型取得了一致性改进,在pass@k指标上最高提升达3.78%。