In this paper, we leverage low-level compiler intermediate representations (IR) to improve code translation. Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation. However, they treat the code as sequences of text tokens, and still do not differentiate well enough between similar pieces of code which have different semantics in different languages. The consequence is low quality translation, reducing the practicality of NMT, and stressing the need for an approach significantly increasing its accuracy. Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages. Our method improves upon the state of the art for unsupervised code translation, increasing the number of correct translations by 11% on average, and up to 79% for the Java -> Rust pair with greedy decoding. With beam search, it increases the number of correct translations by 5.5% in average. We extend previous test sets for code translation, by adding hundreds of Go and Rust functions. Additionally, we train models with high performance on the problem of IR decompilation, generating programming source code from IR, and study using IRs as intermediary pivot for translation.
翻译:本文利用底层编译器中间表示(IR)来改进代码翻译。传统转译器依赖语法信息和手工规则,这限制了其适用性并产生不自然的代码。将神经机器翻译(NMT)方法应用于代码,已成功拓宽了可获得自然翻译的程序集范围。然而,这些方法将代码视为文本标记序列,仍未能充分区分在不同语言中具有不同语义的相似代码片段。其后果是翻译质量低下,降低了NMT的实用性,并凸显了对能显著提升其准确性的方法的迫切需求。为此,我们提出用IR(特别是LLVM IR)增强代码翻译,并在C++、Java、Rust和Go语言上进行了实验。我们的方法改进了无监督代码翻译的现有技术水平,平均正确翻译数量增加了11%,其中在Java→Rust语言对中采用贪心解码时提升高达79%。使用束搜索时,正确翻译数量平均增加5.5%。我们通过添加数百个Go和Rust函数,扩展了现有的代码翻译测试集。此外,我们训练了在IR反编译问题上具有高性能的模型,从IR生成编程源代码,并研究了将IR用作翻译的中介枢轴。