大语言模型生成的JavaScript代码的隐藏DNA：结构模式实现高精度作者归属识别 (The Hidden DNA of LLM-Generated JavaScript: Structural Patterns Enable High-Accuracy Authorship Attribution)

In this paper, we present the first large-scale study exploring whether JavaScript code generated by Large Language Models (LLMs) can reveal which model produced it, enabling reliable authorship attribution and model fingerprinting. With the rapid rise of AI-generated code, attribution is playing a critical role in detecting vulnerabilities, flagging malicious content, and ensuring accountability. While AI-vs-human detection usually treats AI as a single category we show that individual LLMs leave unique stylistic signatures, even among models belonging to the same family or parameter size. To this end, we introduce LLM-NodeJS, a dataset of 50,000 Node.js back-end programs from 20 large language models. Each has four transformed variants, yielding 250,000 unique JavaScript samples and two additional representations (JSIR and AST) for diverse research applications. Using this dataset, we benchmark traditional machine learning classifiers against fine-tuned Transformer encoders and introduce CodeT5-JSA, a custom architecture derived from the 770M-parameter CodeT5 model with its decoder removed and a modified classification head. It achieves 95.8% accuracy on five-class attribution, 94.6% on ten-class, and 88.5% on twenty-class tasks, surpassing other tested models such as BERT, CodeBERT, and Longformer. We demonstrate that classifiers capture deeper stylistic regularities in program dataflow and structure, rather than relying on surface-level features. As a result, attribution remains effective even after mangling, comment removal, and heavy code transformations. To support open science and reproducibility, we release the LLM-NodeJS dataset, Google Colab training scripts, and all related materials on GitHub: https://github.com/LLM-NodeJS-dataset.

翻译：本文首次开展大规模研究，探讨由大语言模型（LLMs）生成的JavaScript代码是否能够揭示其生成模型的身份，从而实现可靠的作者归属识别与模型指纹提取。随着AI生成代码的迅速兴起，归属识别在漏洞检测、恶意内容标记和确保问责方面正发挥着关键作用。尽管AI与人类检测通常将AI视为单一类别，但我们的研究表明，即使属于同一家族或参数规模的模型之间，各个LLM也会留下独特的风格特征。为此，我们构建了LLM-NodeJS数据集，该数据集包含来自20个大语言模型的50,000个Node.js后端程序。每个程序均生成四种变换版本，最终形成250,000个独特的JavaScript样本，并为多样化研究应用提供两种附加表示形式（JSIR与AST）。基于此数据集，我们对传统机器学习分类器与微调Transformer编码器进行了基准测试，并提出了CodeT5-JSA——一种从7.7亿参数CodeT5模型衍生的定制架构，其移除了解码器并采用改进的分类头。该架构在五类别归属任务中达到95.8%准确率，十类别任务达94.6%，二十类别任务达88.5%，性能超越其他测试模型（如BERT、CodeBERT和Longformer）。我们证明分类器能够捕捉程序数据流与结构中更深层的风格规律，而非依赖表层特征。因此，即使经过混淆、注释删除和深度代码变换，归属识别仍保持有效。为支持开放科学与可复现性，我们已在GitHub发布LLM-NodeJS数据集、Google Colab训练脚本及所有相关材料：https://github.com/LLM-NodeJS-dataset。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日