The advent of large language models (LLMs) has ushered in a new era in automated code translation across programming languages. Since most code-specific LLMs are pretrained on well-commented code from large repositories like GitHub, it is reasonable to hypothesize that natural language code comments could aid in improving translation quality. Despite their potential relevance, comments are largely absent from existing code translation benchmarks, rendering their impact on translation quality inadequately characterised. In this paper, we present a large-scale empirical study evaluating the impact of comments on translation performance. Our analysis involves more than $80,000$ translations, with and without comments, of $1100+$ code samples from two distinct benchmarks covering pairwise translations between five different programming languages: C, C++, Go, Java, and Python. Our results provide strong evidence that code comments, particularly those that describe the overall purpose of the code rather than line-by-line functionality, significantly enhance translation accuracy. Based on these findings, we propose COMMENTRA, a code translation approach, and demonstrate that it can potentially double the performance of LLM-based code translation. To the best of our knowledge, our study is the first in terms of its comprehensiveness, scale, and language coverage on how to improve code translation accuracy using code comments.
翻译:大型语言模型(LLM)的出现开启了跨编程语言自动代码翻译的新纪元。由于大多数代码专用LLM是在来自GitHub等大型仓库的、注释良好的代码上进行预训练的,因此有理由假设自然语言代码注释可能有助于提升翻译质量。尽管注释具有潜在相关性,但现有代码翻译基准中普遍缺乏注释,导致其对翻译质量的影响未能得到充分表征。本文提出了一项大规模实证研究,评估注释对翻译性能的影响。我们的分析涉及超过$80,000$次翻译(包含注释与不包含注释两种情况),覆盖来自两个独立基准的$1100+$个代码样本,涵盖五种编程语言(C、C++、Go、Java和Python)之间的成对翻译。研究结果提供了有力证据,表明代码注释(尤其是描述代码整体目的而非逐行功能的注释)能显著提升翻译准确率。基于这些发现,我们提出了COMMENTRA代码翻译方法,并证明其有望将基于LLM的代码翻译性能提升一倍。据我们所知,本研究在如何利用代码注释提升代码翻译准确率方面,就其全面性、规模及语言覆盖范围而言尚属首次。