Generation of pseudo-code descriptions of legacy source code for software maintenance is a manually intensive task. Recent encoder-decoder language models have shown promise for automating pseudo-code generation for high resource programming languages such as C++, but are heavily reliant on the availability of a large code-pseudocode corpus. Soliciting such pseudocode annotations for codes written in legacy programming languages (PL) is a time consuming and costly affair requiring a thorough understanding of the source PL. In this paper, we focus on transferring the knowledge acquired by the code-to-pseudocode neural model trained on a high resource PL (C++) using parallel code-pseudocode data. We aim to transfer this knowledge to a legacy PL (C) with no PL-pseudocode parallel data for training. To achieve this, we utilize an Iterative Back Translation (IBT) approach with a novel test-cases based filtration strategy, to adapt the trained C++-to-pseudocode model to C-to-pseudocode model. We observe an improvement of 23.27% in the success rate of the generated C codes through back translation, over the successive IBT iteration, illustrating the efficacy of our approach.
翻译:为软件维护而生成遗留源代码的伪代码描述是一项手动密集型任务。近期,编码器-解码器语言模型已展现出在C++等高资源编程语言中自动化伪代码生成的潜力,但它们高度依赖于大规模代码-伪代码语料库的可用性。为用遗留编程语言编写的代码征集此类伪代码注释是一项耗时且昂贵的工作,需要对源编程语言有透彻理解。本文聚焦于将利用并行代码-伪代码数据在高资源编程语言(C++)上训练的代码到伪代码神经模型所获取的知识进行迁移。我们旨在将此知识迁移至无编程语言-伪代码并行数据用于训练的遗留编程语言(C)。为实现此目标,我们采用迭代反向翻译方法,并结合一种基于测试用例的新型过滤策略,将训练好的C++到伪代码模型适配为C到伪代码模型。通过反向翻译,我们观察到生成的C代码成功率在连续迭代反向翻译迭代中提高了23.27%,这证明了我们方法的有效性。