Gitor: Scalable Code Clone Detection by Building Global Sample

Code clone detection is about finding out similar code fragments, which has drawn much attention in software engineering since it is important for software maintenance and evolution. Researchers have proposed many techniques and tools for source code clone detection, but current detection methods concentrate on analyzing or processing code samples individually without exploring the underlying connections among code samples. In this paper, we propose Gitor to capture the underlying connections among different code samples. Specifically, given a source code database, we first tokenize all code samples to extract the pre-defined individual information. After obtaining all samples individual information, we leverage them to build a large global sample graph where each node is a code sample or a type of individual information. Then we apply a node embedding technique on the global sample graph to extract all the samples vector representations. After collecting all code samples vectors, we can simply compare the similarity between any two samples to detect possible clone pairs. More importantly, since the obtained vector of a sample is from a global sample graph, we can combine it with its own code features to improve the code clone detection performance. To demonstrate the effectiveness of Gitor, we evaluate it on a widely used dataset namely BigCloneBench. Our experimental results show that Gitor has higher accuracy in terms of code clone detection and excellent execution time for inputs of various sizes compared to existing state-of-the-art tools. Moreover, we also evaluate the combination of Gitor with other traditional vector-based clone detection methods, the results show that the use of Gitor enables them detect more code clones with higher F1.

翻译：摘要：代码克隆检测旨在发现相似的代码片段，由于其对软件维护与演化的重要性，在软件工程领域备受关注。研究者已提出多种源代码克隆检测技术与工具，然而现有检测方法侧重于独立分析或处理代码样本，未能挖掘样本间的潜在关联。本文提出Gitor方法以捕捉不同代码样本间的潜在关联。具体而言，给定源代码数据库，我们首先对所有代码样本进行分词处理以提取预定义的独立信息。获取所有样本的独立信息后，利用这些信息构建一个大规模全局样本图，其中每个节点代表一个代码样本或一类独立信息。随后对全局样本图应用节点嵌入技术，提取所有样本的向量表示。收集所有代码样本向量后，即可通过比较任意两个样本间的相似性来检测潜在的克隆对。更关键的是，由于样本向量源自全局样本图，我们可将其与样本自身的代码特征相结合，从而提升代码克隆检测性能。为验证Gitor的有效性，我们在广泛使用的数据集BigCloneBench上进行了评估。实验结果表明，与现有最先进工具相比，Gitor在代码克隆检测精度方面表现出更高准确性，且对不同规模输入均具有优异的执行效率。此外，我们还评估了将Gitor与其他传统基于向量的克隆检测方法相结合的效果，结果显示引入Gitor后，这些方法能以更高的F1值检测出更多代码克隆。