On the Impact of Multiple Source Code Representations on Software Engineering Tasks -- An Empirical Study

Efficiently representing source code is essential for various software engineering tasks such as code classification and code clone detection. Existing approaches for representing source code primarily use AST, and only a few works focus on semantic graphs such as CFG and PDG, which contain essential information about source code that AST does not have. Even though some works tried to utilize multiple representations, they do not provide any insights about the costs and benefits of using multiple representations against a single appropriate representation for the task. Moreover, they use hand-crafted program features to solve a specific task and have limited use cases. The primary goal of this paper is to discuss the implications of utilizing multiple code representations, specifically AST, CFG, and PDG, and how each of them affects the performance of a task. In this process, we use an approach that can use program features from multiple code graphs while not specifically coupling this approach to a specific task or a language. Our approach stems from the idea of modeling AST as a set of paths and using a learning model to capture program properties. We modify an existing AST path-based approach to accept multiple code representations as input. We do this since it allows us to measure the performance boost provided by additional representations over AST. We evaluate our approach on three tasks: Method Naming, Program Classification, and Code Clone Detection. Our approach increases the performance on these three tasks by 11% (F1), 15.7% (Accuracy), and 9.3% (F1), respectively, over the baseline. We discuss the impact of semantic features from the CFG and PDG paths on performance and the additional overheads incurred through our approach. We envision this work providing researchers with a lens to evaluate combinations of source code representations for various tasks.

翻译：高效表示源代码对于代码分类、代码克隆检测等软件工程任务至关重要。现有源代码表示方法主要采用抽象语法树（AST），仅有少量工作关注控制流图（CFG）和程序依赖图（PDG）等语义图，这些图包含AST所不具备的关键源代码信息。尽管部分研究尝试利用多种表示方法，但并未阐明针对特定任务使用多种表示与单一恰当表示之间的成本效益。此外，这些方法依赖手工设计的程序特征解决特定任务，应用场景有限。本文旨在探讨多代码表示（特别是AST、CFG和PDG）的运用影响，以及每种表示如何影响任务性能。在此过程中，我们提出一种方法，既能从多代码图中提取程序特征，又不将其与特定任务或语言耦合。该方法源自将AST建模为路径集合的思想，通过学习模型捕获程序属性。我们改进了现有的AST路径方法，使其支持多代码表示输入——这一设计可评估额外表示方法相较于AST的性能提升幅度。在方法命名、程序分类和代码克隆检测三项任务上的评估显示，该方法较基线分别提升了11%（F1值）、15.7%（准确率）和9.3%（F1值）。我们探讨了CFG和PDG路径中语义特征对性能的影响，以及该方法引入的额外开销。本研究旨在为研究者提供评估不同源代码表示组合效用的视角。