Code Clone Detection, which aims to retrieve functionally similar programs from large code bases, has been attracting increasing attention. Modern software often involves a diverse range of programming languages. However, current code clone detection methods are generally limited to only a few popular programming languages due to insufficient annotated data as well as their own model design constraints. To address these issues, we present AdaCCD, a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language. AdaCCD leverages language-agnostic code representations from pre-trained programming language models and propose an Adaptively Refined Contrastive Learning framework to transfer knowledge from resource-rich languages to resource-poor languages. We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages. AdaCCD achieves significant improvements over other baselines, and achieve comparable performance to supervised fine-tuning.
翻译:代码克隆检测旨在从大规模代码库中检索功能相似的程序,近年来受到越来越多的关注。现代软件通常涉及多种编程语言。然而,由于标注数据不足以及模型设计本身的局限性,当前的代码克隆检测方法通常仅限于少数几种主流编程语言。为解决这些问题,我们提出了AdaCCD,这是一种新颖的跨语言适配方法,能够在不依赖新语言标注数据的情况下检测该语言中的克隆代码。AdaCCD利用预训练编程语言模型中的语言无关代码表示,并提出一种自适应精化对比学习框架,以将知识从资源丰富的语言迁移至资源匮乏的语言。我们通过构建包含5种编程语言的多语言代码克隆检测基准来评估AdaCCD的跨语言适配效果。实验结果表明,AdaCCD相较于其他基线方法取得了显著提升,并能达到与监督微调相当的性能。