Evaluating few shot and Contrastive learning Methods for Code Clone Detection

Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of $\sim$95\% on the CodeXGLUE benchmark. These models require many training data, mainly fine-tuned on Java or C++ datasets. However, no previous study evaluates the generalizability of these models where a limited amount of annotated data is available. Objective: The main objective of this research is to assess the ability of the CCD models as well as few shot learning algorithms for unseen programming problems and new languages (i.e., the model is not trained on these problems/languages). Method: We assess the generalizability of the state of the art models for CCD in few shot settings (i.e., only a few samples are available for fine-tuning) by setting three scenarios: i) unseen problems, ii) unseen languages, iii) combination of new languages and new problems. We choose three datasets of BigCloneBench, POJ-104, and CodeNet and Java, C++, and Ruby languages. Then, we employ Model Agnostic Meta-learning (MAML), where the model learns a meta-learner capable of extracting transferable knowledge from the train set; so that the model can be fine-tuned using a few samples. Finally, we combine contrastive learning with MAML to further study whether it can improve the results of MAML.

翻译：背景：代码克隆检测（CCD）是一项软件工程任务，用于抄袭检测、代码搜索和代码理解。近年来，基于深度学习的模型在CodeXGLUE基准测试中取得了约95%的F1分数（一种评估分类器的指标）。这些模型需要大量训练数据，主要针对Java或C++数据集进行微调。然而，先前的研究并未评估这些模型在标注数据有限的场景下的泛化能力。目标：本研究的主要目标是评估CCD模型以及小样本学习算法对未见编程问题和新语言（即模型未在这些问题/语言上训练过）的适应能力。方法：我们通过设置三种场景来评估当前最先进CCD模型在小样本设置（即仅有少量样本可用于微调）下的泛化能力：i) 未见问题，ii) 未见语言，iii) 新语言与新问题的组合。我们选取了BigCloneBench、POJ-104和CodeNet三个数据集，涵盖Java、C++和Ruby语言。随后，我们采用模型无关元学习（MAML），使模型学习一个能够从训练集中提取可迁移知识的元学习器，从而仅需少量样本即可进行微调。最后，我们将对比学习与MAML结合，进一步研究其是否能够提升MAML的效果。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日