Large Language Models (LLMs) have shown strong performance in automated source-to-target code translation through pretraining on extensive code corpora. However, mainstream LLM-based code translation methods suffer from two critical limitations. First, they are highly sensitive to language-specific features, which often introduce source-language syntax or lexicon into the output, leading to syntactic confusion. Second, they lack fine-grained semantic alignment due to an over-reliance on function-level parallel datasets, resulting in semantic misalignment between the translated code and the original source. To overcome these limitations, we propose TIT, a Tree-structured Instruction Tuning paradigm for LLM-based code translation. Specifically, TIT consists of three modules. First, to mitigate syntactic confusion, the syntactic information representation module integrates language-agnostic syntactic features via structured parsing. Then, to generate high-quality fine-grained parallel data, the fine-grained parallel dataset augmentation module aligns nodes with code segments through statement-level segmentation and contrastive matching. Finally, we leverage the dual-stage tree instruction tuning module to alleviate the contextual processing burden on the LLM caused by the introduction of syntactic information. The first stage employs syntax-aware fine-tuning to enable the LLM to autonomously comprehend structured syntactic information, while the second stage utilizes code generation fine-tuning to guide the model in generating accurate target code based on function-level syntactic dependencies. The experimental results demonstrate that the proposed method significantly outperforms existing approaches in multiple LLMs, achieving a success rate 1.22x-1.75x higher in code translation while markedly reducing syntactic confusion.
翻译:大语言模型通过在大量代码语料上进行预训练,在自动化源到目标代码翻译任务中展现出强大性能。然而,主流基于大语言模型的代码翻译方法存在两个关键局限:首先,它们对语言特定特征高度敏感,常将源语言语法或词汇引入输出,导致句法混淆;其次,由于过度依赖函数级平行数据集,缺乏细粒度语义对齐,导致翻译代码与原始源代码间出现语义失配。为克服这些局限,我们提出TIT,一种用于基于大语言模型代码翻译的树状结构指令调优范式。具体而言,TIT包含三个模块:首先,为缓解句法混淆,句法信息表示模块通过结构化解析整合语言无关的句法特征;其次,为生成高质量细粒度平行数据,细粒度平行数据增强模块通过语句级分割与对比匹配实现节点与代码段的对齐;最后,我们利用双阶段树指令调优模块减轻句法信息引入对大语言模型造成的上下文处理负担。第一阶段采用句法感知微调使大语言模型自主理解结构化句法信息,第二阶段利用代码生成微调引导模型基于函数级句法依赖生成准确的目标代码。实验结果表明,所提方法在多种大语言模型中显著优于现有方法,代码翻译成功率提升1.22-1.75倍,同时显著降低了句法混淆。