We consider the clone detection and information retrieval problems for source code, well-known tasks important for any programming language. Although it is also an important and interesting problem to find code snippets that operate identically but are written in different programming languages, to the best of our knowledge multilingual clone detection has not been studied in literature. In this work, we formulate the multilingual clone detection problem and present XCD, a new benchmark dataset produced from the CodeForces submissions dataset. Moreover, we present a novel training procedure, called cross-consistency training (CCT), that we apply to train language models on source code in different programming languages. The resulting CCT-LM model, initialized with GraphCodeBERT and fine-tuned with CCT, achieves new state of the art, outperforming existing approaches on the POJ-104 clone detection benchmark with 95.67\% MAP and AdvTest code search benchmark with 47.18\% MRR; it also shows the best results on the newly created multilingual clone detection benchmark XCD across all programming languages.
翻译:我们研究了源代码的克隆检测与信息检索问题,这是对任何编程语言都至关重要的经典任务。尽管在不同编程语言中寻找功能相同但语法不同的代码片段同样重要且有趣,但据我们所知,多语言克隆检测尚未在文献中进行过研究。本文提出了多语言克隆检测问题,并基于CodeForces提交数据集构建了新的基准数据集XCD。此外,我们提出了一种名为交叉一致性训练(CCT)的新型训练流程,用于训练针对不同编程语言源代码的语言模型。由此产生的CCT-LM模型以GraphCodeBERT初始化并通过CCT微调,在POJ-104克隆检测基准上以95.67%的MAP和AdvTest代码搜索基准上以47.18%的MRR刷新了当前最优水平;同时,该模型在新创建的多语言克隆检测基准XCD上,在所有编程语言中均取得了最佳结果。