Existing code similarity metrics, such as BLEU, CodeBLEU, and TSED, largely rely on surface-level string overlap or abstract syntax tree structures, and often fail to capture deeper semantic relationships between programs.We propose CSSG (Code Similarity using Semantic Graphs), a novel metric that leverages program dependence graphs to explicitly model control dependencies and variable interactions, providing a semantics-aware representation of code.Experiments on the CodeContests+ dataset show that CSSG consistently outperforms existing metrics in distinguishing more similar code from less similar code under both monolingual and cross-lingual settings, demonstrating that dependency-aware graph representations offer a more effective alternative to surface-level or syntax-based similarity measures.
翻译:现有的代码相似性度量方法,如BLEU、CodeBLEU和TSED,主要依赖于表层的字符串重叠或抽象语法树结构,往往无法捕捉程序间更深层的语义关系。我们提出了CSSG(基于语义图的代码相似性度量),这是一种新颖的度量方法,它利用程序依赖图来显式地建模控制依赖关系和变量交互,从而提供一种语义感知的代码表示。在CodeContests+数据集上的实验表明,无论是在单语言还是跨语言设置下,CSSG在区分更相似代码与较不相似代码方面均持续优于现有度量方法,这证明依赖感知的图表示提供了一种比基于表层或语法的相似性度量更有效的替代方案。