Large-scale source-code clone detection is a challenging task. In our previous work, we proposed an approach (SSCD) that leverages artificial neural networks and approximates nearest neighbour search to effectively and efficiently locate clones in large-scale bodies of code, in a time-efficient manner. However, our literature review suggests that the relative efficacy of differing neural network models has not been assessed in the context of large-scale clone detection approaches. In this work, we aim to assess several such models individually, in terms of their potential to maximize recall, while preserving a high level of precision during clone detection. We investigate if ensemble inference (in this case, using the results of more than one of these neural network models in combination) can further assist in this task. To assess this, we employed four state-of-the-art neural network models and evaluated them individually/in combination. The results, on an illustrative dataset of approximately 500K lines of C/C++ code, suggest that ensemble inference outperforms individual models in all trialled cases, when recall is concerned. Of individual models, the ADA model (belonging to the ChatGPT family of models) has the best performance. However commercial companies may not be prepared to hand their proprietary source code over to the cloud, as required by that approach. Consequently, they may be more interested in an ensemble-combination of CodeBERT-based and CodeT5 models, resulting in similar (if slightly lesser) recall and precision results.
翻译:大规模源代码克隆检测是一项具有挑战性的任务。在我们先前的研究中,提出了一种方法(SSCD),该方法利用人工神经网络和近似最近邻搜索,以高效且时间优化的方式在大型代码库中定位克隆。然而,文献综述表明,在大规模克隆检测方法背景下,不同神经网络模型的相对有效性尚未得到充分评估。本研究旨在单独评估几种此类模型在最大化召回率(同时保持克隆检测的高精确率)方面的潜力,并探究集成推理(即联合使用多个神经网络模型的结果)能否进一步辅助此任务。为此,我们采用了四种最先进的神经网络模型,并分别对其进行了单独评估及组合评估。在约50万行C/C++代码的示例数据集上的结果表明,在召回率方面,集成推理在所有试验案例中均优于单个模型。在单个模型中,ADA模型(属于ChatGPT模型家族)表现最佳。然而,商业公司可能不愿按照该方法的要求将专有源代码交予云端处理。因此,他们可能更关注基于CodeBERT和CodeT5模型的集成组合,该组合能产生相似(即使召回率和精确率略有下降)的结果。