Deep Learning (DL) is becoming more and more widespread in clone detection, motivated by achieving near-perfect performance for this task. In particular in case of semantic code clones, which share only limited syntax but implement the same or similar functionality, Deep Learning appears to outperform conventional tools. In this paper, we want to investigate the generalizability of DL-based clone detectors for Java. We therefore replicate and evaluate the performance of five state-of-the-art DL-based clone detectors, including Transformers like CodeBERT and single-task models like FA-AST+GMN, in a zero-shot evaluation scenario, where we train/fine-tune and evaluate on different datasets and functionalities. Our experiments demonstrate that the models' generalizability to unseen code is limited. Further analysis reveals that the conventional clone detector NiCad even outperforms the DL-based clone detectors in such a zero-shot evaluation scenario.
翻译:深度学习在克隆检测中日益普及,其动机是实现该任务的近乎完美性能。特别是在语义代码克隆(仅共享有限语法但实现相同或相似功能)的情况下,深度学习似乎优于传统工具。本文旨在探究基于深度学习的克隆检测器在Java中的泛化能力。为此,我们在零样本评估场景中复现并评估了五种最先进的基于深度学习的克隆检测器的性能,包括CodeBERT等Transformer模型和FA-AST+GMN等单任务模型。在此场景中,我们在不同数据集和功能上进行训练/微调与评估。实验表明,这些模型对未见代码的泛化能力有限。进一步分析显示,传统克隆检测器NiCad在此类零样本评估场景中甚至优于基于深度学习的克隆检测器。