Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection

Code clones can detrimentally impact software maintenance and manually detecting them in very large codebases is impractical. Additionally, automated approaches find detection of Type 3 and Type 4 (inexact) clones very challenging. While the most recent artificial deep neural networks (for example BERT-based artificial neural networks) seem to be highly effective in detecting such clones, their pairwise comparison of every code pair in the target system(s) is inefficient and scales poorly on large codebases. We therefore introduce SSCD, a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale (in line with our industrial partner's requirements). It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search. SSCD thus avoids the pairwise-comparison bottleneck of other Neural Network approaches while also using parallel, GPU-accelerated search to tackle scalability. This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting. The configuration analysis suggests that shorter input lengths and text-only based neural network models demonstrate better efficiency in SSCD, while only slightly decreasing effectiveness. The evaluation results suggest that SSCD is more effective than state-of-the-art approaches like SAGA and SourcererCC. It is also highly efficient: in its optimal setting, SSCD effectively locates clones in the entire 320 million LOC BigCloneBench (a standard clone detection benchmark) in just under three hours.

翻译：代码克隆会严重影响软件维护，而人工检测超大规模代码库中的克隆是不切实际的。此外，自动化方法在检测类型3和类型4（非精确）克隆时面临极大挑战。尽管最新的人工深度神经网络（例如基于BERT的人工神经网络）在检测此类克隆方面表现出高效性，但它们需要对目标系统中的每一对代码进行两两比较，效率低下且难以扩展到大型代码库。为此，我们提出SSCD——一种基于BERT的克隆检测方法，旨在规模化地实现类型3和类型4克隆的高召回率（符合工业合作伙伴需求）。该方法通过为每个代码片段计算代表性嵌入向量，并利用最近邻搜索查找相似片段，从而避免了其他神经网络方法中两两比较的瓶颈，同时采用并行的GPU加速搜索提升可扩展性。本文详细阐述了该方法，并提供了面向工业环境配置与评估该方法的实证研究。配置分析表明，较短的输入长度和纯文本神经网络模型在SSCD中展现出更高效率，且仅轻微降低有效性。评估结果显示，SSCD比SAGA和SourcererCC等现有最优方法更有效，且效率极高：在最佳配置下，SSCD仅需不到三小时即可完成对包含3.2亿行代码的标准克隆检测基准BigCloneBench的全量克隆定位。

相关内容

Neural Networks

关注 1654

神经网络（Neural Networks）是世界上三个最古老的神经建模学会的档案期刊:国际神经网络学会(INNS)、欧洲神经网络学会(ENNS)和日本神经网络学会(JNNS)。神经网络提供了一个论坛，以发展和培育一个国际社会的学者和实践者感兴趣的所有方面的神经网络和相关方法的计算智能。神经网络欢迎高质量论文的提交，有助于全面的神经网络研究，从行为和大脑建模，学习算法，通过数学和计算分析，系统的工程和技术应用，大量使用神经网络的概念和技术。这一独特而广泛的范围促进了生物和技术研究之间的思想交流，并有助于促进对生物启发的计算智能感兴趣的跨学科社区的发展。因此，神经网络编委会代表的专家领域包括心理学，神经生物学，计算机科学，工程，数学，物理。该杂志发表文章、信件和评论以及给编辑的信件、社论、时事、软件调查和专利信息。文章发表在五个部分之一:认知科学，神经科学，学习系统，数学和计算分析、工程和应用。官网地址：http://dblp.uni-trier.de/db/journals/nn/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日