This paper investigates source code similarity detection using a transformer model augmented with an execution-derived signal. We extend GraphCodeBERT with an explicit, low-dimensional behavioral feature that captures observable agreement between code fragments, and fuse this signal with the pooled transformer representation through a trainable output head. We compute behavioral agreement via output comparisons under a fixed test suite and use this observed output agreement as an operational approximation of semantic similarity between code pairs. The resulting feature acts as an explicit behavioral signature that complements token- and graph-based representations. Experiments on established clone detection benchmarks show consistent improvements in precision, recall, and F$_1$ over the unmodified GraphCodeBERT baseline, with the largest gains on semantically equivalent but syntactically dissimilar pairs. The source code that illustrates our approach can be downloaded from https://www.github.com/jorge-martinez-gil/graphcodebert-feature-integration.
翻译:本文研究了一种利用增强执行信号的Transformer模型进行源代码相似性检测的方法。我们在GraphCodeBERT基础上引入了一种显式的低维行为特征,该特征能够捕获代码片段间可观测的一致性,并通过可训练的输出头将该信号与池化的Transformer表示进行融合。我们通过固定测试套件下的输出比较来计算行为一致性,并将这种观测到的输出一致性作为代码对之间语义相似性的操作化近似。所得特征作为一种显式行为签名,对基于词元和图结构的表示形成了补充。在已建立的克隆检测基准测试上的实验表明,相较于未经修改的GraphCodeBERT基线模型,该方法在精确率、召回率和F$_1$分数上均取得了持续提升,其中在语义等价但语法不相似的代码对上提升最为显著。阐述本方法实现的源代码可从https://www.github.com/jorge-martinez-gil/graphcodebert-feature-integration下载。