Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.
翻译:度量和评估源代码相似性是软件工程的一项基础性活动,涵盖广泛的应用领域,包括但不限于代码推荐、重复代码、剽窃、恶意软件及代码异味检测。本文提出了一项关于代码相似度度量与评估技术的系统性文献综述和元分析,旨在揭示现有方法及其在不同应用中的特性。我们最初通过查询四个数字图书馆检索到超过10000篇文章,最终筛选出该领域的136篇主要研究。这些研究根据其方法、编程语言、数据集、工具和应用进行了分类。深入调查揭示了80款软件工具,这些工具在五个应用领域中采用八种不同的技术。近49%的工具支持Java程序,37%支持C和C++,但对许多其他编程语言缺乏支持。值得注意的是,存在12个与源代码相似度及重复代码相关的数据集,其中仅8个可公开访问。该领域面临的主要挑战包括缺乏可靠的数据集、经验性评估、混合方法以及对多范式语言的关注不足。代码相似度度量的新兴应用除维护阶段外,还集中于开发阶段。