Code clone detection plays a critical role in software maintenance and vulnerability analysis. Substantial methods have been proposed to detect code clones. However, they struggle to extract high-level program semantics directly from a single linear token sequence, leading to unsatisfactory detection performance. A similar single-sequence challenge has been successfully addressed in protein structure prediction by AlphaFold. Motivated by the successful resolution of the shared single-sequence challenge by AlphaFold, as well as the sequential similarities between proteins and code, we leverage AlphaFold for code clone detection. In particular, we propose AlphaCC, which represents code fragments as token sequences and adapts AlphaFold's sequence-to-structure modeling capability to infer code semantics. The pipeline of AlphaCC goes through three steps. First, AlphaCC transforms each input code fragment into a token sequence and, motivated by AlphaFold's use of multiple sequence alignment (MSA), novelly uses a retrieval-augmentation strategy to construct an MSA from lexically similar token sequences. Second, AlphaCC adopts a modified attention-based encoder based on AlphaFold to model dependencies within and across token sequences. Finally, unlike AlphaFold's protein structure prediction task, AlphaCC computes similarity scores between token sequences through a late interaction strategy and performs binary classification to determine code clone pairs. Comprehensive evaluations on three datasets, particularly two semantic clone detection datasets, show that AlphaCC consistently outperforms all baselines, demonstrating strong semantic understanding. AlphaCC further achieves strong performance on instances where tool-dependent methods fail, highlighting its tool-independence. Moreover, AlphaCC maintains competitive efficiency, enabling practical usage in large-scale clone detection tasks.
翻译:代码克隆检测在软件维护和漏洞分析中起着关键作用。已有大量方法被提出用于检测代码克隆。然而,这些方法难以直接从单一线性标记序列中提取高层次程序语义,导致检测性能不尽人意。在蛋白质结构预测领域,AlphaFold已成功解决了类似的单序列挑战。受AlphaFold成功解决共享单序列挑战的启发,以及蛋白质与代码在序列层面的相似性,我们将AlphaFold应用于代码克隆检测。具体而言,我们提出了AlphaCC,该方法将代码片段表示为标记序列,并借鉴AlphaFold从序列到结构的建模能力来推断代码语义。AlphaCC的流程包含三个步骤。首先,AlphaCC将每个输入代码片段转换为标记序列,并受AlphaFold使用多序列比对(MSA)的启发,创新性地采用检索增强策略从词汇相似的标记序列构建MSA。其次,AlphaCC基于AlphaFold采用改进的注意力编码器来建模标记序列内部及序列间的依赖关系。最后,与AlphaFold的蛋白质结构预测任务不同,AlphaCC通过延迟交互策略计算标记序列间的相似度分数,并执行二元分类以判定代码克隆对。在三个数据集(特别是两个语义克隆检测数据集)上的综合评估表明,AlphaCC始终优于所有基线方法,展现出强大的语义理解能力。AlphaCC进一步在依赖工具的方法失效的实例上表现出色,凸显了其工具无关性。此外,AlphaCC保持了较高的效率,使其能够在大规模克隆检测任务中实际应用。