In this paper, we present the first large-scale study exploring whether JavaScript code generated by Large Language Models (LLMs) can reveal which model produced it, enabling reliable authorship attribution and model fingerprinting. With the rapid rise of AI-generated code, attribution is playing a critical role in detecting vulnerabilities, flagging malicious content, and ensuring accountability. While AI-vs-human detection usually treats AI as a single category we show that individual LLMs leave unique stylistic signatures, even among models belonging to the same family or parameter size. To this end, we introduce LLM-NodeJS, a dataset of 50,000 Node.js back-end programs from 20 large language models. Each has four transformed variants, yielding 250,000 unique JavaScript samples and two additional representations (JSIR and AST) for diverse research applications. Using this dataset, we benchmark traditional machine learning classifiers against fine-tuned Transformer encoders and introduce CodeT5-JSA, a custom architecture derived from the 770M-parameter CodeT5 model with its decoder removed and a modified classification head. It achieves 95.8% accuracy on five-class attribution, 94.6% on ten-class, and 88.5% on twenty-class tasks, surpassing other tested models such as BERT, CodeBERT, and Longformer. We demonstrate that classifiers capture deeper stylistic regularities in program dataflow and structure, rather than relying on surface-level features. As a result, attribution remains effective even after mangling, comment removal, and heavy code transformations. To support open science and reproducibility, we release the LLM-NodeJS dataset, Google Colab training scripts, and all related materials on GitHub: https://github.com/LLM-NodeJS-dataset.
翻译:本文首次开展大规模研究,探讨由大语言模型(LLMs)生成的JavaScript代码是否能够揭示其生成模型的身份,从而实现可靠的作者归属识别与模型指纹提取。随着AI生成代码的迅速兴起,归属识别在漏洞检测、恶意内容标记和确保问责方面正发挥着关键作用。尽管AI与人类检测通常将AI视为单一类别,但我们的研究表明,即使属于同一家族或参数规模的模型之间,各个LLM也会留下独特的风格特征。为此,我们构建了LLM-NodeJS数据集,该数据集包含来自20个大语言模型的50,000个Node.js后端程序。每个程序均生成四种变换版本,最终形成250,000个独特的JavaScript样本,并为多样化研究应用提供两种附加表示形式(JSIR与AST)。基于此数据集,我们对传统机器学习分类器与微调Transformer编码器进行了基准测试,并提出了CodeT5-JSA——一种从7.7亿参数CodeT5模型衍生的定制架构,其移除了解码器并采用改进的分类头。该架构在五类别归属任务中达到95.8%准确率,十类别任务达94.6%,二十类别任务达88.5%,性能超越其他测试模型(如BERT、CodeBERT和Longformer)。我们证明分类器能够捕捉程序数据流与结构中更深层的风格规律,而非依赖表层特征。因此,即使经过混淆、注释删除和深度代码变换,归属识别仍保持有效。为支持开放科学与可复现性,我们已在GitHub发布LLM-NodeJS数据集、Google Colab训练脚本及所有相关材料:https://github.com/LLM-NodeJS-dataset。