Gitor: Scalable Code Clone Detection by Building Global Sample Graph

Code clone detection is about finding out similar code fragments, which has drawn much attention in software engineering since it is important for software maintenance and evolution. Researchers have proposed many techniques and tools for source code clone detection, but current detection methods concentrate on analyzing or processing code samples individually without exploring the underlying connections among code samples. In this paper, we propose Gitor to capture the underlying connections among different code samples. Specifically, given a source code database, we first tokenize all code samples to extract the pre-defined individual information. After obtaining all samples individual information, we leverage them to build a large global sample graph where each node is a code sample or a type of individual information. Then we apply a node embedding technique on the global sample graph to extract all the samples vector representations. After collecting all code samples vectors, we can simply compare the similarity between any two samples to detect possible clone pairs. More importantly, since the obtained vector of a sample is from a global sample graph, we can combine it with its own code features to improve the code clone detection performance. To demonstrate the effectiveness of Gitor, we evaluate it on a widely used dataset namely BigCloneBench. Our experimental results show that Gitor has higher accuracy in terms of code clone detection and excellent execution time for inputs of various sizes compared to existing state-of-the-art tools. Moreover, we also evaluate the combination of Gitor with other traditional vector-based clone detection methods, the results show that the use of Gitor enables them detect more code clones with higher F1.

翻译：摘要：代码克隆检测旨在发现相似的代码片段，由于其对软件维护与演化具有重要意义，已成为软件工程领域备受关注的研究方向。现有方法和工具虽已能实现源代码克隆检测，但当前检测方法多集中于对代码样本的独立分析与处理，未能充分挖掘代码样本间的潜在关联。本文提出Gitor方法，旨在捕获不同代码样本间的隐含关联。具体而言，给定源代码数据库后，首先对所有代码样本进行分词处理以提取预定义的个体信息；随后利用这些个体信息构建大规模全局样本图，其中每个节点代表一个代码样本或一类个体信息。接着对全局样本图应用节点嵌入技术，提取所有样本的向量表示。基于获取的向量表征，可直接通过计算样本间相似性来检测可能的克隆对。更关键的是，由于样本向量源自全局样本图，可将其与代码自身特征相结合以提升检测性能。为验证Gitor的有效性，我们在广泛使用的BigCloneBench数据集上展开评估。实验结果表明，与现有先进工具相比，Gitor在代码克隆检测精度上表现更优，且对不同规模输入数据具有卓越的执行效率。此外，我们还将Gitor与传统基于向量的克隆检测方法结合使用，结果显示该方法能显著提升克隆检测数量并获得更高的F1值。