The application of deep learning techniques in software engineering becomes increasingly popular. One key problem is developing high-quality and easy-to-use source code representations for code-related tasks. The research community has acquired impressive results in recent years. However, due to the deployment difficulties and performance bottlenecks, seldom these approaches are applied to the industry. In this paper, we present xASTNN, an eXtreme Abstract Syntax Tree (AST)-based Neural Network for source code representation, aiming to push this technique to industrial practice. The proposed xASTNN has three advantages. First, xASTNN is completely based on widely-used ASTs and does not require complicated data pre-processing, making it applicable to various programming languages and practical scenarios. Second, three closely-related designs are proposed to guarantee the effectiveness of xASTNN, including statement subtree sequence for code naturalness, gated recursive unit for syntactical information, and gated recurrent unit for sequential information. Third, a dynamic batching algorithm is introduced to significantly reduce the time complexity of xASTNN. Two code comprehension downstream tasks, code classification and code clone detection, are adopted for evaluation. The results demonstrate that our xASTNN can improve the state-of-the-art while being faster than the baselines.
翻译:深度学习技术在软件工程中的应用日益广泛,其中一个关键问题是为代码相关任务开发高质量且易于使用的源代码表示方法。近年来,研究社区已取得令人瞩目的成果。然而,受限于部署困难和性能瓶颈,这些方法很少被应用于工业实践。本文提出了xASTNN——一种基于极简抽象语法树(AST)的神经网络源代码表示方法,旨在推动该技术走向工业实践。该xASTNN具有三大优势:第一,完全基于广泛使用的AST,无需复杂的数据预处理,使其适用于多种编程语言和实际场景;第二,提出三项紧密关联的设计以保证xASTNN的有效性,包括用于代码自然性的语句子树序列、用于语法信息的门控递归单元以及用于序列信息的门控循环单元;第三,引入动态批处理算法以显著降低xASTNN的时间复杂度。我们采用代码分类和代码克隆检测这两项代码理解下游任务进行评估,结果表明xASTNN能在提升最先进性能的同时,比基线方法具有更快的运行速度。