Code translation aims to convert source code from one programming language (PL) to another. Given the promising abilities of large language models (LLMs) in code synthesis, researchers are actively exploring their potential to automate code translation, i.e., generating code in target PL from its equivalent in another PL. The pre-requisite for advancing the state of LLM-based code translation is to understand their limitations. To that end, we present a large-scale empirical study to investigate the ability of LLMs, including general LLMs and code LLMs, for code translation across pairs of different languages, including C, C++, Go, Java, and Python. Our analysis involves the translation of 1,700 code samples from three distinct benchmarks and real-world projects, revealing LLMs are yet to be reliably used to automate code translation -- with incorrect translations ranging from 52.7% to 97.9% across the studied LLMs. Further manual investigation of unsuccessful translations among all PLs identifies 14 root causes for translation bugs. Based on the insights from the empirical study, we propose a prompt-crafting approach to provide additional context for LLMs, improving the performance of LLM-based code translation by 5.5% on average across different PLs, LLMs, and benchmarks. Our study is the first of its kind, in terms of its scale and breadth, that provides insights into the current limitations of LLMs in code translation and opportunities for improving them. Our collected extensive dataset -- consisting of 1,700 code samples written in five PLs with 10K+ tests, 43K+ translated code, 1,725 manually labeled bugs, and 1,365 bug-fix pairs generated using LLMs -- can help drive research in this area.
翻译:代码翻译旨在将源代码从一种编程语言(PL)转换为另一种编程语言。鉴于大型语言模型(LLM)在代码合成方面展现出令人期待的能力,研究人员正积极探索其在自动化代码翻译中的潜力,即从一种编程语言的代码生成另一种编程语言的等效代码。推动基于LLM的代码翻译发展的前提是理解其局限性。为此,我们进行了一项大规模实证研究,考察LLM(包括通用LLM和代码LLM)在不同语言对(包括C、C++、Go、Java和Python)之间的代码翻译能力。我们的分析涉及从三个不同基准测试和真实世界项目中收集的1,700个代码样本的翻译,结果发现LLM尚不能可靠地用于自动化代码翻译——在所研究的LLM中,错误翻译的比例高达52.7%至97.9%。进一步对所有编程语言中失败翻译的手动调查,识别出了14种翻译错误的根本原因。基于实证研究的洞察,我们提出了一种提示设计方法,为LLM提供额外上下文,使基于LLM的代码翻译在不同编程语言、LLM和基准测试上的性能平均提升了5.5%。我们的研究在规模和广度上均属首次,揭示了当前LLM在代码翻译中的局限性以及改进机会。我们收集的大规模数据集——包括用五种编程语言编写的1,700个代码样本、超过10,000个测试、43,000多个翻译后的代码、1,725个人工标注的错误以及1,365个由LLM生成的错误修复对——将有助于推动该领域的研究。