The rapid evolution of programming languages and software systems has necessitated the implementation of multilingual and scalable clone detection tools. However, it is difficult to achieve the above requirements at the same time. Most existing tools only focus on one challenge. In this work, we propose TGMM, a tree and GPU-based tool for multilingual and multi-granularity code clone detection. By generating parse trees based on user-provided grammar files, TGMM can extract code blocks at a specified granularity and detect Type-3 clones efficiently. In order to show the performance of TGMM, we compare it with seven state-of-the-art tools in terms of recall, precision, and execution time. TGMM ranks first in execution time and precision, while its recall is comparable to the others. Moreover, we analyzed the language extensibility of TGMM across 30 mainstream programming languages. Out of these, a total of 25 languages were supported, while the remaining five currently lack the necessary grammar files. Finally, we analyzed the clone characteristics of nine popular languages at five common granularities, hoping to inspire future researchers. The source code of TGMM is available at: https://github.com/TGMM24/TGMM.git.
翻译:编程语言与软件系统的快速发展使得实现多语言、可扩展的克隆检测工具成为必要。然而,同时满足上述要求具有挑战性,现有工具大多仅聚焦于单一问题。本研究提出TGMM——一种基于解析树与GPU的多语言多粒度代码克隆检测工具。通过用户提供的语法文件生成解析树,TGMM能够提取指定粒度的代码块并高效检测Type-3克隆。为评估TGMM性能,我们在召回率、精确率与执行时间三个维度上将其与七种前沿工具进行对比。TGMM在执行时间与精确率上均位列第一,其召回率亦与其他工具相当。此外,我们分析了TGMM在30种主流编程语言上的可扩展性,其中25种语言已获支持,剩余5种语言因暂缺语法文件而无法适配。最后,我们针对九种流行语言在五种常见粒度上的克隆特征进行了分析,以期为后续研究提供启示。TGMM源代码已开源:https://github.com/TGMM24/TGMM.git。