Neural Models for Source Code Synthesis and Completion

Natural language (NL) to code suggestion systems assist developers in Integrated Development Environments (IDEs) by translating NL utterances into compilable code snippet. The current approaches mainly involve hard-coded, rule-based systems based on semantic parsing. These systems make heavy use of hand-crafted rules that map patterns in NL or elements in its syntax parse tree to various query constructs and can only work on a limited subset of NL with a restricted NL syntax. These systems are unable to extract semantic information from the coding intents of the developer, and often fail to infer types, names, and the context of the source code to get accurate system-level code suggestions. In this master thesis, we present sequence-to-sequence deep learning models and training paradigms to map NL to general-purpose programming languages that can assist users with suggestions of source code snippets, given a NL intent, and also extend auto-completion functionality of the source code to users while they are writing source code. The developed architecture incorporates contextual awareness into neural models which generate source code tokens directly instead of generating parse trees/abstract meaning representations from the source code and converting them back to source code. The proposed pretraining strategy and the data augmentation techniques improve the performance of the proposed architecture. The proposed architecture has been found to exceed the performance of a neural semantic parser, TranX, based on the BLEU-4 metric by 10.82%. Thereafter, a finer analysis for the parsable code translations from the NL intent for CoNaLA challenge was introduced. The proposed system is bidirectional as it can be also used to generate NL code documentation given source code. Lastly, a RoBERTa masked language model for Python was proposed to extend the developed system for code completion.

翻译：自然语言（NL）到代码建议系统通过将NL表述转换为可编译代码片段，为集成开发环境（IDE）中的开发者提供辅助。现有方法主要基于语义解析的硬编码规则系统。这些系统大量依赖手工构建的规则，将NL中的模式或其语法解析树元素映射至各类查询构造，但仅能对受限NL语法子集生效。此类系统无法从开发者的编码意图中提取语义信息，且常因无法推断类型、名称及源代码上下文而难以生成精确的系统级代码建议。本硕士论文提出了序列到序列深度学习模型及训练范式，用于将NL映射至通用编程语言，可在给定NL意图时辅助用户获取源代码片段建议，并在用户编写代码时扩展源代码的自动补全功能。所开发的架构将上下文感知能力融入神经模型，直接生成源代码词元，而非先通过源代码生成解析树/抽象含义表示再将其转换回代码。所提出的预训练策略与数据增强技术提升了架构性能。实验表明，所提架构在BLEU-4指标上较神经语义解析器TranX提升10.82%。此外，针对CoNaLA挑战赛的NL意图可解析代码翻译进行了精细分析。该双向系统还可用于根据源代码生成NL代码文档。最后，针对Python语言提出RoBERTa掩码语言模型，以扩展所开发系统的代码补全能力。