There is growing interest in software migration as the development of software and society. Manually migrating projects between languages is error-prone and expensive. In recent years, researchers have begun to explore automatic program translation using supervised deep learning techniques by learning from large-scale parallel code corpus. However, parallel resources are scarce in the programming language domain, and it is costly to collect bilingual data manually. To address this issue, several unsupervised programming translation systems are proposed. However, these systems still rely on huge monolingual source code to train, which is very expensive. Besides, these models cannot perform well for translating the languages that are not seen during the pre-training procedure. In this paper, we propose SDA-Trans, a syntax and domain-aware model for program translation, which leverages the syntax structure and domain knowledge to enhance the cross-lingual transfer ability. SDA-Trans adopts unsupervised training on a smaller-scale corpus, including Python and Java monolingual programs. The experimental results on function translation tasks between Python, Java, and C++ show that SDA-Trans outperforms many large-scale pre-trained models, especially for unseen language translation.
翻译:随着软件与社会的发展,软件迁移日益受到关注。跨语言项目的手动迁移过程易出错且成本高昂。近年来,研究者开始探索利用监督式深度学习技术,通过大规模并行代码语料库学习实现自动程序翻译。然而,编程语言领域中的并行资源十分稀缺,人工收集双语数据的成本高昂。为解决此问题,已有研究提出若干无监督编程翻译系统,但这些系统仍需依赖海量单语言源代码进行训练,代价极为昂贵。此外,这类模型在翻译预训练阶段未涉及的语言时表现不佳。本文提出SDA-Trans——一种语法与领域感知的程序翻译模型,该模型利用语法结构和领域知识增强跨语言迁移能力。SDA-Trans采用无监督训练方式,使用包括Python和Java单语言程序在内的小规模语料库。在Python、Java与C++之间的函数翻译任务实验结果表明,SDA-Trans的性能优于许多大规模预训练模型,尤其在处理未见语言的翻译任务时表现突出。