Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of $\sim$95\% on the CodeXGLUE benchmark. These models require many training data, mainly fine-tuned on Java or C++ datasets. However, no previous study evaluates the generalizability of these models where a limited amount of annotated data is available. Objective: The main objective of this research is to assess the ability of the CCD models as well as few shot learning algorithms for unseen programming problems and new languages (i.e., the model is not trained on these problems/languages). Method: We assess the generalizability of the state of the art models for CCD in few shot settings (i.e., only a few samples are available for fine-tuning) by setting three scenarios: i) unseen problems, ii) unseen languages, iii) combination of new languages and new problems. We choose three datasets of BigCloneBench, POJ-104, and CodeNet and Java, C++, and Ruby languages. Then, we employ Model Agnostic Meta-learning (MAML), where the model learns a meta-learner capable of extracting transferable knowledge from the train set; so that the model can be fine-tuned using a few samples. Finally, we combine contrastive learning with MAML to further study whether it can improve the results of MAML.
翻译:背景:代码克隆检测(CCD)是一项软件工程任务,用于抄袭检测、代码搜索和代码理解。近年来,基于深度学习的模型在CodeXGLUE基准测试中取得了约95%的F1分数(一种评估分类器的指标)。这些模型需要大量训练数据,主要针对Java或C++数据集进行微调。然而,先前的研究并未评估这些模型在标注数据有限的场景下的泛化能力。目标:本研究的主要目标是评估CCD模型以及小样本学习算法对未见编程问题和新语言(即模型未在这些问题/语言上训练过)的适应能力。方法:我们通过设置三种场景来评估当前最先进CCD模型在小样本设置(即仅有少量样本可用于微调)下的泛化能力:i) 未见问题,ii) 未见语言,iii) 新语言与新问题的组合。我们选取了BigCloneBench、POJ-104和CodeNet三个数据集,涵盖Java、C++和Ruby语言。随后,我们采用模型无关元学习(MAML),使模型学习一个能够从训练集中提取可迁移知识的元学习器,从而仅需少量样本即可进行微调。最后,我们将对比学习与MAML结合,进一步研究其是否能够提升MAML的效果。