Code translation aims to convert source code from one programming language (PL) to another. Given the promising abilities of large language models (LLMs) in code synthesis, researchers are exploring their potential to automate code translation. The prerequisite for advancing the state of LLM-based code translation is to understand their promises and limitations over existing techniques. To that end, we present a large-scale empirical study to investigate the ability of general LLMs and code LLMs for code translation across pairs of different languages, including C, C++, Go, Java, and Python. Our study, which involves the translation of 1,700 code samples from three benchmarks and two real-world projects, reveals that LLMs are yet to be reliably used to automate code translation -- with correct translations ranging from 2.1% to 47.3% for the studied LLMs. Further manual investigation of unsuccessful translations identifies 15 categories of translation bugs. We also compare LLM-based code translation with traditional non-LLM-based approaches. Our analysis shows that these two classes of techniques have their own strengths and weaknesses. Finally, insights from our study suggest that providing more context to LLMs during translation can help them produce better results. To that end, we propose a prompt-crafting approach based on the symptoms of erroneous translations; this improves the performance of LLM-based code translation by 5.5% on average. Our study is the first of its kind, in terms of scale and breadth, that provides insights into the current limitations of LLMs in code translation and opportunities for improving them. Our dataset -- consisting of 1,700 code samples in five PLs with 10K+ tests, 43K+ translated code, 1,725 manually labeled bugs, and 1,365 bug-fix pairs -- can help drive research in this area.
翻译:代码翻译旨在将源代码从一种编程语言(PL)转换为另一种。鉴于大语言模型(LLM)在代码合成方面展现出令人鼓舞的能力,研究者正探索将其用于自动化代码翻译的潜力。推进基于LLM的代码翻译技术发展的前提,在于理解其相对于现有技术的优势与局限。为此,我们开展了一项大规模实证研究,考察通用LLM和代码LLM在不同编程语言对(包括C、C++、Go、Java和Python)间的代码翻译能力。本研究涉及来自三个基准测试和两个真实世界项目的1700个代码样本的翻译,结果表明LLM尚不能可靠地用于自动化代码翻译——在所研究的LLM中,正确翻译比例介于2.1%至47.3%之间。对失败翻译的进一步人工调查识别出15类翻译缺陷。我们还将基于LLM的代码翻译与传统非LLM方法进行了比较。分析显示,这两类技术各有优劣。最后,研究启示表明在翻译过程中为LLM提供更多上下文信息有助于提升其翻译效果。据此,我们提出一种基于错误翻译表征的提示工程方法,该方法平均将LLM代码翻译性能提升了5.5%。本研究在规模和广度上属首次,揭示了当前LLM在代码翻译中的局限性及改进方向。我们的数据集包含5种编程语言的1700个代码样本(附带10K+测试用例)、43K+翻译代码、1725个人工标注缺陷及1365个缺陷修复对,可为该领域研究提供支撑。