Many software projects implement APIs and algorithms in multiple programming languages. Maintaining such projects is tiresome, as developers have to ensure that any change (e.g., a bug fix or a new feature) is being propagated, timely and without errors, to implementations in other programming languages. In the world of ever-changing software, using rule-based translation tools (i.e., transpilers) or machine learning models for translating code from one language to another provides limited value. Translating each time the entire codebase from one language to another is not the way developers work. In this paper, we target a novel task: translating code changes from one programming language to another using large language models (LLMs). We design and implement the first LLM, dubbed Codeditor, to tackle this task. Codeditor explicitly models code changes as edit sequences and learns to correlate changes across programming languages. To evaluate Codeditor, we collect a corpus of 6,613 aligned code changes from 8 pairs of open-source software projects implementing similar functionalities in two programming languages (Java and C#). Results show that Codeditor outperforms the state-of-the-art approaches by a large margin on all commonly used automatic metrics. Our work also reveals that Codeditor is complementary to the existing generation-based models, and their combination ensures even greater performance.
翻译:许多软件项目使用多种编程语言实现API和算法。维护此类项目非常繁琐,因为开发人员必须确保任何变更(例如错误修复或新功能)都能及时且无差错地传播到其他编程语言的实现中。在软件不断变化的世界中,使用基于规则的翻译工具(即代码转换器)或机器学习模型将代码从一种语言翻译成另一种语言的作用有限。每次都从头将整个代码库从一种语言翻译成另一种语言并非开发人员的工作方式。本文针对一项新任务:利用大语言模型(LLM)将代码变更从一种编程语言翻译成另一种编程语言。我们设计并实现了首个针对此任务的LLM,命名为Codeditor。Codeditor将代码变更显式建模为编辑序列,并学习跨编程语言关联这些变更。为评估Codeditor,我们收集了6,613组对齐的代码变更语料库,这些变更来自8对使用两种编程语言(Java和C#)实现相似功能的开源软件项目。结果表明,在所有常用的自动评估指标上,Codeditor均大幅优于现有最先进方法。我们的工作还揭示,Codeditor与现有基于生成的模型具有互补性,两者结合可实现更优性能。