Improving Source Code Similarity Detection Through GraphCodeBERT and Integration of Additional Features - 专知论文

会员服务 ·

0

代码 · 相似性 · 融合 · 代码相似性检测 · 一致 ·

Improving Source Code Similarity Detection Through GraphCodeBERT and Integration of Additional Features

翻译：基于GraphCodeBERT与附加特征融合的源代码相似性检测改进

Jorge Martinez-Gil

from arxiv, 16 pages

This paper investigates source code similarity detection using a transformer model augmented with an execution-derived signal. We extend GraphCodeBERT with an explicit, low-dimensional behavioral feature that captures observable agreement between code fragments, and fuse this signal with the pooled transformer representation through a trainable output head. We compute behavioral agreement via output comparisons under a fixed test suite and use this observed output agreement as an operational approximation of semantic similarity between code pairs. The resulting feature acts as an explicit behavioral signature that complements token- and graph-based representations. Experiments on established clone detection benchmarks show consistent improvements in precision, recall, and F$_1$ over the unmodified GraphCodeBERT baseline, with the largest gains on semantically equivalent but syntactically dissimilar pairs. The source code that illustrates our approach can be downloaded from https://www.github.com/jorge-martinez-gil/graphcodebert-feature-integration.

翻译：本文研究了一种利用增强执行信号的Transformer模型进行源代码相似性检测的方法。我们在GraphCodeBERT基础上引入了一种显式的低维行为特征，该特征能够捕获代码片段间可观测的一致性，并通过可训练的输出头将该信号与池化的Transformer表示进行融合。我们通过固定测试套件下的输出比较来计算行为一致性，并将这种观测到的输出一致性作为代码对之间语义相似性的操作化近似。所得特征作为一种显式行为签名，对基于词元和图结构的表示形成了补充。在已建立的克隆检测基准测试上的实验表明，相较于未经修改的GraphCodeBERT基线模型，该方法在精确率、召回率和F$_1$分数上均取得了持续提升，其中在语义等价但语法不相似的代码对上提升最为显著。阐述本方法实现的源代码可从https://www.github.com/jorge-martinez-gil/graphcodebert-feature-integration下载。

0

相关内容

代码（Code）是专知网的一个重要知识资料文档板块，旨在整理收录论文源代码、复现代码，经典工程代码等，便于用户查阅下载使用。

Graph Transformer近期进展

Graph Transformer近期进展

专知会员服务

65+阅读 · 2023年1月5日

【ICML2022】Branchformer:并行MLP-Attention架构，捕捉局部和全局上下文，用于语音识别和理解

【ICML2022】Branchformer:并行MLP-Attention架构，捕捉局部和全局上下文，用于语音识别和理解

专知会员服务

25+阅读 · 2022年7月8日

【CVPR 2022】基于Transformer的图象风格化，StyTr2: Image Style Transfer with Transformers

【CVPR 2022】基于Transformer的图象风格化，StyTr2: Image Style Transfer with Transformers

专知会员服务

11+阅读 · 2022年3月19日

【ICCV 2021 】Vision Transformer中的相对位置编码

专知会员服务

30+阅读 · 2021年7月30日

二进制代码相似性检测技术综述

专知会员服务

17+阅读 · 2021年5月13日

【ICLR2021】从理解到改进：序列到序列建模中的编码器特征融合

【ICLR2021】从理解到改进：序列到序列建模中的编码器特征融合

专知会员服务

37+阅读 · 2021年2月12日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

51+阅读 · 2020年3月7日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

【深度图相似学习综述】Deep Graph Similarity Learning: A Survey，29页pdf，117条参考文献

【深度图相似学习综述】Deep Graph Similarity Learning: A Survey，29页pdf，117条参考文献

专知会员服务

98+阅读 · 2019年12月31日

【AAAI2023】用于图对比学习的谱特征增强

【AAAI2023】用于图对比学习的谱特征增强

专知

20+阅读 · 2022年12月11日

【商汤科技】可变形Transformers端到端对象检测，Deformable DETR

【商汤科技】可变形Transformers端到端对象检测，Deformable DETR

专知

18+阅读 · 2020年10月11日

【Code】GraphSAGE 源码解析

【Code】GraphSAGE 源码解析

AINLP

31+阅读 · 2020年6月22日

【Amazon】使用预训练Transformer模型进行数据增强

【Amazon】使用预训练Transformer模型进行数据增强

专知

12+阅读 · 2020年3月6日

Transformers就是图神经网络？NTU-Chaitanya Joshi论述: 是GNN的一个特例

Transformers就是图神经网络？NTU-Chaitanya Joshi论述: 是GNN的一个特例

专知

20+阅读 · 2020年3月1日

如何找到相似Graph？DeepMind提出超越GNN的图匹配网络

如何找到相似Graph？DeepMind提出超越GNN的图匹配网络

机器之心

24+阅读 · 2019年5月7日

Github 项目推荐 | 论文的代码实现：可变形ConvNets v2的PyTorch实现

Github 项目推荐 | 论文的代码实现：可变形ConvNets v2的PyTorch实现

AI研习社

22+阅读 · 2019年1月10日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

DeepMind无监督表示学习重大突破：语音、图像、文本、强化学习全能冠军！

DeepMind无监督表示学习重大突破：语音、图像、文本、强化学习全能冠军！

新智元

12+阅读 · 2018年7月13日

资源 | GitHub新项目：轻松使用多种预训练卷积网络抽取图像特征

资源 | GitHub新项目：轻松使用多种预训练卷积网络抽取图像特征

机器之心

12+阅读 · 2018年4月16日

基于深度信念网络的高光谱遥感影像变化检测方法研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于视觉特性的目标检测算法研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于区分型码本的图像表示的研究与应用

国家自然科学基金

1+阅读 · 2015年12月31日

基于密集快速特征提取的可视媒体篡改检测研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于改进型视觉注意模型的多模态极相似图像检索方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于成像环境约束的低质量图像篡改取证研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于框架提升变换的多源图像融合研究

国家自然科学基金

1+阅读 · 2015年12月31日

融合目标感知与对比度的图像和视频显著性检测技术研究

国家自然科学基金

4+阅读 · 2015年12月31日

结合图像块联合聚类加权和混合分类器的非对齐稀疏表示识别方法

国家自然科学基金

1+阅读 · 2015年12月31日

基于深度学习的三维模型检索技术

国家自然科学基金

13+阅读 · 2014年12月31日

Detecting AI-Generated Images via Contextual Anomaly Estimation in Masked AutoEncoders

Arxiv

0+阅读 · 3月9日

Neural Discrimination-Prompted Transformers for Efficient UHD Image Restoration and Enhancement

Arxiv

0+阅读 · 3月1日

Scaling Vision Transformers: Evaluating DeepSpeed for Image-Centric Workloads

Arxiv

0+阅读 · 2月24日

Le-DETR: Revisiting Real-Time Detection Transformer with Efficient Encoder Design

Arxiv

0+阅读 · 2月24日

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Arxiv

0+阅读 · 2月20日

TEGRA: Text Encoding With Graph and Retrieval Augmentation for Misinformation Detection

Arxiv

0+阅读 · 2月11日

Patch-Level Tokenization with CNN Encoders and Attention for Improved Transformer Time-Series Forecasting

Arxiv

0+阅读 · 2月10日

Revisiting [CLS] and Patch Token Interaction in Vision Transformers

Arxiv

0+阅读 · 2月9日

Code Clone Detection via an AlphaFold-Inspired Framework

Arxiv

0+阅读 · 2月5日

AlignGemini: Generalizable AI-Generated Image Detection Through Task-Model Alignment

Arxiv

0+阅读 · 1月30日

VIP会员

文章信息

相关主题

代码相似性检测

最新内容

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

专知会员服务

1+阅读 · 今天14:49

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

专知会员服务

1+阅读 · 今天14:47

学习数据的几何：形状空间分析数学综述

学习数据的几何：形状空间分析数学综述

专知会员服务

1+阅读 · 今天14:45

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

专知会员服务

3+阅读 · 今天14:22

定向能反无人机系统最新发展动态

定向能反无人机系统最新发展动态

专知会员服务

5+阅读 · 今天13:50

从燃煤战舰到算法战争：水面指挥的永恒要求

从燃煤战舰到算法战争：水面指挥的永恒要求

专知会员服务

3+阅读 · 今天13:33

《短程弹道再入飞行器拦截时间中的一项异常现象》

《短程弹道再入飞行器拦截时间中的一项异常现象》

专知会员服务

3+阅读 · 今天13:30

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

专知会员服务

3+阅读 · 今天13:28

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

专知会员服务

3+阅读 · 今天13:13

《韩国国防政策与军备出口：韩国安全与国防政策如何塑造其国防工业与军备出口格局》最新100页报告

《韩国国防政策与军备出口：韩国安全与国防政策如何塑造其国防工业与军备出口格局》最新100页报告

专知会员服务

2+阅读 · 今天13:10

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

专知会员服务

5+阅读 · 6月16日

多模态代码智能综述：从视觉输入到可执行代码系统

多模态代码智能综述：从视觉输入到可执行代码系统

专知会员服务

7+阅读 · 6月16日

美国马六甲“三重网”概念：安全网、威慑网与杀伤网

美国马六甲“三重网”概念：安全网、威慑网与杀伤网

专知会员服务

5+阅读 · 6月16日

《面向导弹有效发射时机的监督机器学习方法：基于超视距空战仿真》

《面向导弹有效发射时机的监督机器学习方法：基于超视距空战仿真》

专知会员服务

5+阅读 · 6月16日

《通用大语言模型：无人机指挥与控制接口》最新40页

《通用大语言模型：无人机指挥与控制接口》最新40页

专知会员服务

15+阅读 · 6月16日

相关VIP内容

Graph Transformer近期进展

Graph Transformer近期进展

专知会员服务

65+阅读 · 2023年1月5日

【ICML2022】Branchformer:并行MLP-Attention架构，捕捉局部和全局上下文，用于语音识别和理解

【ICML2022】Branchformer:并行MLP-Attention架构，捕捉局部和全局上下文，用于语音识别和理解

专知会员服务

25+阅读 · 2022年7月8日

【CVPR 2022】基于Transformer的图象风格化，StyTr2: Image Style Transfer with Transformers

【CVPR 2022】基于Transformer的图象风格化，StyTr2: Image Style Transfer with Transformers

专知会员服务

11+阅读 · 2022年3月19日

【ICCV 2021 】Vision Transformer中的相对位置编码

专知会员服务

30+阅读 · 2021年7月30日

二进制代码相似性检测技术综述

专知会员服务

17+阅读 · 2021年5月13日

【ICLR2021】从理解到改进：序列到序列建模中的编码器特征融合

【ICLR2021】从理解到改进：序列到序列建模中的编码器特征融合

专知会员服务

37+阅读 · 2021年2月12日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

51+阅读 · 2020年3月7日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

【深度图相似学习综述】Deep Graph Similarity Learning: A Survey，29页pdf，117条参考文献

【深度图相似学习综述】Deep Graph Similarity Learning: A Survey，29页pdf，117条参考文献

专知会员服务

98+阅读 · 2019年12月31日

热门VIP内容

开通专知VIP会员享更多权益服务

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

学习数据的几何：形状空间分析数学综述

相关资讯

【AAAI2023】用于图对比学习的谱特征增强

【AAAI2023】用于图对比学习的谱特征增强

专知

20+阅读 · 2022年12月11日

【商汤科技】可变形Transformers端到端对象检测，Deformable DETR

【商汤科技】可变形Transformers端到端对象检测，Deformable DETR

专知

18+阅读 · 2020年10月11日

【Code】GraphSAGE 源码解析

【Code】GraphSAGE 源码解析

AINLP

31+阅读 · 2020年6月22日

【Amazon】使用预训练Transformer模型进行数据增强

【Amazon】使用预训练Transformer模型进行数据增强

专知

12+阅读 · 2020年3月6日

Transformers就是图神经网络？NTU-Chaitanya Joshi论述: 是GNN的一个特例

Transformers就是图神经网络？NTU-Chaitanya Joshi论述: 是GNN的一个特例

专知

20+阅读 · 2020年3月1日

如何找到相似Graph？DeepMind提出超越GNN的图匹配网络

如何找到相似Graph？DeepMind提出超越GNN的图匹配网络

机器之心

24+阅读 · 2019年5月7日

Github 项目推荐 | 论文的代码实现：可变形ConvNets v2的PyTorch实现

Github 项目推荐 | 论文的代码实现：可变形ConvNets v2的PyTorch实现

AI研习社

22+阅读 · 2019年1月10日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

DeepMind无监督表示学习重大突破：语音、图像、文本、强化学习全能冠军！

DeepMind无监督表示学习重大突破：语音、图像、文本、强化学习全能冠军！

新智元

12+阅读 · 2018年7月13日

资源 | GitHub新项目：轻松使用多种预训练卷积网络抽取图像特征

资源 | GitHub新项目：轻松使用多种预训练卷积网络抽取图像特征

机器之心

12+阅读 · 2018年4月16日

相关论文

Detecting AI-Generated Images via Contextual Anomaly Estimation in Masked AutoEncoders

Arxiv

0+阅读 · 3月9日

Neural Discrimination-Prompted Transformers for Efficient UHD Image Restoration and Enhancement

Arxiv

0+阅读 · 3月1日

Scaling Vision Transformers: Evaluating DeepSpeed for Image-Centric Workloads

Arxiv

0+阅读 · 2月24日

Le-DETR: Revisiting Real-Time Detection Transformer with Efficient Encoder Design

Arxiv

0+阅读 · 2月24日

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Arxiv

0+阅读 · 2月20日

TEGRA: Text Encoding With Graph and Retrieval Augmentation for Misinformation Detection

Arxiv

0+阅读 · 2月11日

Patch-Level Tokenization with CNN Encoders and Attention for Improved Transformer Time-Series Forecasting

Arxiv

0+阅读 · 2月10日

Revisiting [CLS] and Patch Token Interaction in Vision Transformers

Arxiv

0+阅读 · 2月9日

Code Clone Detection via an AlphaFold-Inspired Framework

Arxiv

0+阅读 · 2月5日

AlignGemini: Generalizable AI-Generated Image Detection Through Task-Model Alignment

Arxiv

0+阅读 · 1月30日

相关基金

基于深度信念网络的高光谱遥感影像变化检测方法研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于视觉特性的目标检测算法研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于区分型码本的图像表示的研究与应用

国家自然科学基金

1+阅读 · 2015年12月31日

基于密集快速特征提取的可视媒体篡改检测研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于改进型视觉注意模型的多模态极相似图像检索方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于成像环境约束的低质量图像篡改取证研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于框架提升变换的多源图像融合研究

国家自然科学基金

1+阅读 · 2015年12月31日

融合目标感知与对比度的图像和视频显著性检测技术研究

国家自然科学基金

4+阅读 · 2015年12月31日

结合图像块联合聚类加权和混合分类器的非对齐稀疏表示识别方法

国家自然科学基金

1+阅读 · 2015年12月31日

基于深度学习的三维模型检索技术

国家自然科学基金

13+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员