Efficiently representing source code is crucial for various software engineering tasks such as code classification and clone detection. Existing approaches primarily use Abstract Syntax Tree (AST), and only a few focus on semantic graphs such as Control Flow Graph (CFG) and Program Dependency Graph (PDG), which contain information about source code that AST does not. Even though some works tried to utilize multiple representations, they do not provide any insights about the costs and benefits of using multiple representations. The primary goal of this paper is to discuss the implications of utilizing multiple code representations, specifically AST, CFG, and PDG. We modify an AST path-based approach to accept multiple representations as input to an attention-based model. We do this to measure the impact of additional representations (such as CFG and PDG) over AST. We evaluate our approach on three tasks: Method Naming, Program Classification, and Clone Detection. Our approach increases the performance on these tasks by 11% (F1), 15.7% (Accuracy), and 9.3% (F1), respectively, over the baseline. In addition to the effect on performance, we discuss timing overheads incurred with multiple representations. We envision this work providing researchers with a lens to evaluate combinations of code representations for various tasks.
翻译:高效表示源代码对于代码分类和克隆检测等各类软件工程任务至关重要。现有方法主要使用抽象语法树(AST),仅有少数关注控制流图(CFG)和程序依赖图(PDG)等语义图,这些图包含AST未涉及的源代码信息。尽管部分工作尝试利用多重表示,但并未阐明使用多重表示的成本与收益。本文主要目标是探讨利用多重代码表示(尤其是AST、CFG和PDG)的意义。我们改进了一种基于AST路径的方法,使其能接受多重表示作为注意力模型的输入,从而衡量CFG和PDG等附加表示相较于AST的影响。我们在方法命名、程序分类和克隆检测三项任务上评估该方法:相较于基线,该方法在F1分数、准确率和F1分数上分别提升11%、15.7%和9.3%。除性能影响外,我们还讨论了多重表示带来的时间开销。本研究旨在为研究者提供评估不同任务中代码表示组合的视角。