The application of deep learning techniques in software engineering becomes increasingly popular. One key problem is developing high-quality and easy-to-use source code representations for code-related tasks. The research community has acquired impressive results in recent years. However, due to the deployment difficulties and performance bottlenecks, seldom these approaches are applied to the industry. In this paper, we present xASTNN, an eXtreme Abstract Syntax Tree (AST)-based Neural Network for source code representation, aiming to push this technique to industrial practice. The proposed xASTNN has three advantages. First, xASTNN is completely based on widely-used ASTs and does not require complicated data pre-processing, making it applicable to various programming languages and practical scenarios. Second, three closely-related designs are proposed to guarantee the effectiveness of xASTNN, including statement subtree sequence for code naturalness, gated recursive unit for syntactical information, and gated recurrent unit for sequential information. Third, a dynamic batching algorithm is introduced to significantly reduce the time complexity of xASTNN. Two code comprehension downstream tasks, code classification and code clone detection, are adopted for evaluation. The results demonstrate that our xASTNN can improve the state-of-the-art while being faster than the baselines.
翻译:深度学习技术在软件工程中的应用日益普及,其中核心问题是为代码相关任务开发高质量且易用的源代码表示方法。近年来学术界已取得显著成果,但由于部署困难与性能瓶颈,这些方法鲜少应用于工业领域。本文提出xASTNN——一种基于极端抽象语法树的神经网络源代码表示方法,旨在将该技术推向工业实践。所提出的xASTNN具备三大优势:其一,完全基于广泛使用的AST,无需复杂数据预处理,可适用于多种编程语言及实际场景;其二,提出三种密切关联的设计以确保有效性,包括面向代码自然性的语句子树序列、面向语法信息的门控递归单元,以及面向序列信息的门控循环单元;其三,引入动态批处理算法以显著降低xASTNN的时间复杂度。采用代码分类与代码克隆检测两项代码理解下游任务进行评估,结果表明xASTNN在提升最先进方法性能的同时,运行速度优于基线模型。