Many software projects implement APIs and algorithms in multiple programming languages. Maintaining such projects is tiresome, as developers have to ensure that any change (e.g., a bug fix or a new feature) is being propagated, timely and without errors, to implementations in other programming languages. In the world of ever-changing software, using rule-based translation tools (i.e., transpilers) or machine learning models for translating code from one language to another provides limited value. Translating each time the entire codebase from one language to another is not the way developers work. In this paper, we target a novel task: translating code changes from one programming language to another using large language models (LLMs). We design and implement the first LLM, dubbed Codeditor, to tackle this task. Codeditor explicitly models code changes as edit sequences and learns to correlate changes across programming languages. To evaluate Codeditor, we collect a corpus of 6,613 aligned code changes from 8 pairs of open-source software projects implementing similar functionalities in two programming languages (Java and C#). Results show that Codeditor outperforms the state-of-the-art approaches by a large margin on all commonly used automatic metrics. Our work also reveals that Codeditor is complementary to the existing generation-based models, and their combination ensures even greater performance.
翻译:许多软件项目使用多种编程语言实现API和算法。维护此类项目十分繁琐,因为开发者必须确保任何变更(如缺陷修复或新功能)都能及时且无误地传播到其他编程语言的实现中。在软件持续演变的背景下,使用基于规则的翻译工具(即转译器)或机器学习模型进行代码跨语言翻译的价值有限。每次将整个代码库从一种语言翻译到另一种语言并非开发者的工作方式。本文针对一项新任务:利用大语言模型将代码变更从一种编程语言翻译到另一种编程语言。我们设计并实现了首个用于解决该任务的大语言模型——Codeditor。该模型将代码变更显式建模为编辑序列,并学习跨编程语言关联这些变更。为评估Codeditor,我们构建了一个包含6,613组对齐代码变更的语料库,这些变更来自8对采用两种编程语言(Java和C#)实现相似功能的开源软件项目。结果表明,在所有常用自动评估指标上,Codeditor均大幅优于现有最优方法。我们的工作还揭示,Codeditor与现有基于生成的模型具有互补性,两者结合可进一步提升性能。