We consider the well-known and important tasks of clone detection and information retrieval for source code. The most standard setup is to search clones inside the same language code snippets. But it is also useful to find code snippets with identical behaviour in different programming languages. Nevertheless multi- and cross-lingual clone detection has been little studied in literature. We present a novel training procedure, cross-consistency training (CCT) leveraging cross-lingual similarity, that we apply to train language models on source code in various programming languages. We show that this training is effective both for encoder- and decoder-based models. The trained encoder-based CCT-LM model achieves a new state of the art on POJ-104 (monolingual C++ clone detection benchmark) with 96.73\% MAP and AdvTest (monolingual Python code search benchmark) with 47.18\% MRR. The decoder-based CCT-LM model shows comparable performance in these tasks. In addition, we formulate the multi- and cross-lingual clone detection problem and present XCD, a new benchmark dataset produced from CodeForces submissions.
翻译:我们研究了源代码克隆检测与信息检索这两个众所周知且重要的任务。最标准的设置是在同一语言的代码片段中搜索克隆。然而,在不同编程语言中查找具有相同行为的代码片段也很有用。尽管如此,多语言与跨语言克隆检测在文献中却鲜有研究。我们提出了一种新颖的训练方法——跨一致性训练(CCT),该方法利用跨语言相似性,用于在各种编程语言的源代码上训练语言模型。我们证明这种训练对于基于编码器和基于解码器的模型都有效。训练后的基于编码器的CCT-LM模型在POJ-104(单语言C++克隆检测基准)上达到了96.73%的平均准确率均值(MAP)的新最优水平,在AdvTest(单语言Python代码搜索基准)上达到了47.18%的平均倒数排名(MRR)。基于解码器的CCT-LM模型在这些任务中表现出可比的性能。此外,我们形式化了多语言与跨语言克隆检测问题,并提出了XCD——一个基于CodeForces提交记录生成的新基准数据集。