Code clones can detrimentally impact software maintenance and manually detecting them in very large codebases is impractical. Additionally, automated approaches find detection of Type 3 and Type 4 (inexact) clones very challenging. While the most recent artificial deep neural networks (for example BERT-based artificial neural networks) seem to be highly effective in detecting such clones, their pairwise comparison of every code pair in the target system(s) is inefficient and scales poorly on large codebases. We therefore introduce SSCD, a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale (in line with our industrial partner's requirements). It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search. SSCD thus avoids the pairwise-comparison bottleneck of other Neural Network approaches while also using parallel, GPU-accelerated search to tackle scalability. This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting. The configuration analysis suggests that shorter input lengths and text-only based neural network models demonstrate better efficiency in SSCD, while only slightly decreasing effectiveness. The evaluation results suggest that SSCD is more effective than state-of-the-art approaches like SAGA and SourcererCC. It is also highly efficient: in its optimal setting, SSCD effectively locates clones in the entire 320 million LOC BigCloneBench (a standard clone detection benchmark) in just under three hours.
翻译:代码克隆会严重影响软件维护,而人工检测超大规模代码库中的克隆是不切实际的。此外,自动化方法在检测类型3和类型4(非精确)克隆时面临极大挑战。尽管最新的人工深度神经网络(例如基于BERT的人工神经网络)在检测此类克隆方面表现出高效性,但它们需要对目标系统中的每一对代码进行两两比较,效率低下且难以扩展到大型代码库。为此,我们提出SSCD——一种基于BERT的克隆检测方法,旨在规模化地实现类型3和类型4克隆的高召回率(符合工业合作伙伴需求)。该方法通过为每个代码片段计算代表性嵌入向量,并利用最近邻搜索查找相似片段,从而避免了其他神经网络方法中两两比较的瓶颈,同时采用并行的GPU加速搜索提升可扩展性。本文详细阐述了该方法,并提供了面向工业环境配置与评估该方法的实证研究。配置分析表明,较短的输入长度和纯文本神经网络模型在SSCD中展现出更高效率,且仅轻微降低有效性。评估结果显示,SSCD比SAGA和SourcererCC等现有最优方法更有效,且效率极高:在最佳配置下,SSCD仅需不到三小时即可完成对包含3.2亿行代码的标准克隆检测基准BigCloneBench的全量克隆定位。