Source code clone detection is the task of finding code fragments that have the same or similar functionality, but may differ in syntax or structure. This task is important for software maintenance, reuse, and quality assurance (Roy et al. 2009). However, code clone detection is challenging, as source code can be written in different languages, domains, and styles. In this paper, we argue that source code is inherently a graph, not a sequence, and that graph-based methods are more suitable for code clone detection than sequence-based methods. We compare the performance of two state-of-the-art models: CodeBERT (Feng et al. 2020), a sequence-based model, and CodeGraph (Yu et al. 2023), a graph-based model, on two benchmark data-sets: BCB (Svajlenko et al. 2014) and PoolC (PoolC no date). We show that CodeGraph outperforms CodeBERT on both data-sets, especially on cross-lingual code clones. To the best of our knowledge, this is the first work to demonstrate the superiority of graph-based methods over sequence-based methods on cross-lingual code clone detection.
翻译:源代码克隆检测是指查找功能相同或相似但语法或结构可能不同的代码片段的任务,该任务对软件维护、复用及质量保证至关重要(Roy等人,2009年)。然而,由于源代码可以用不同语言、领域和风格编写,代码克隆检测具有挑战性。本文认为,源代码本质上是图而非序列,且基于图的方法比基于序列的方法更适合代码克隆检测。我们在两个基准数据集——BCB(Svajlenko等人,2014年)和PoolC(PoolC,无日期)上,比较了两种最先进模型——基于序列的CodeBERT(Feng等人,2020年)与基于图的CodeGraph(Yu等人,2023年)的性能。实验表明,CodeGraph在两个数据集上的表现均优于CodeBERT,尤其在跨语言代码克隆场景中优势显著。据我们所知,这是首个证明基于图的方法在跨语言代码克隆检测中优于基于序列的方法的研究工作。