The rapid evolution of programming languages and software systems has necessitated the implementation of multilingual and scalable clone detection tools. However, it is difficult to achieve the above requirements at the same time. Most existing tools only focus on one challenge. In this work, we propose TGMM, a tree and GPU-based tool for multilingual and multi-granularity code clone detection. By generating parse trees based on user-provided grammar files, TGMM can extract code blocks at a specified granularity and detect Type-3 clones efficiently. In order to show the performance of TGMM, we compare it with seven state-of-the-art tools in terms of recall, precision, and execution time. TGMM ranks first in execution time and precision, while its recall is comparable to the others. Moreover, we analyzed the language extensibility of TGMM across 30 mainstream programming languages. Out of these, a total of 25 languages were supported, while the remaining five currently lack the necessary grammar files. Finally, we analyzed the clone characteristics of nine popular languages at five common granularities, hoping to inspire future researchers. The source code of TGMM is available at: https://github.com/TGMM24/TGMM.git.
翻译:编程语言和软件系统的快速发展,促使了多语言、可扩展的克隆检测工具的实现。然而,同时实现上述要求存在困难。现有工具大多仅专注于单一挑战。本文提出TGMM——一种基于树与GPU的多语言多粒度代码克隆检测工具。通过根据用户提供的语法文件生成解析树,TGMM能够以指定粒度提取代码块,并高效检测Type-3类型克隆。为展示TGMM性能,我们从召回率、精确率和执行时间三个维度将其与七种前沿工具进行对比。TGMM在执行时间和精确率上排名第一,召回率与其他工具相当。此外,我们分析了TGMM在30种主流编程语言中的语言可扩展性:其中共支持25种语言,其余5种目前缺乏必要的语法文件。最后,我们分析了九种流行语言在五种常见粒度下的克隆特征,以期为未来研究者提供启发。TGMM源代码可于 https://github.com/TGMM24/TGMM.git 获取。