As Large Language Models (LLMs) for code increasingly utilize massive, often non-permissively licensed datasets, evaluating data contamination through Membership Inference Attacks (MIAs) has become critical. We propose SERSEM (Selective Entropy-Weighted Scoring for Membership Inference), a novel white-box attack framework that suppresses uninformative syntactical boilerplate to amplify specific memorization signals. SERSEM utilizes a dual-signal methodology: first, a continuous character-level weight mask is derived through static Abstract Syntax Tree (AST) analysis, spellchecking-based multilingual logic detection, and offline linting. Second, these heuristic weights are used to pool internal transformer activations and calibrate token-level Z-scores from the output logits. Evaluated on a 25,000-sample balanced dataset, SERSEM achieves a global AUC-ROC of 0.7913 on the StarCoder2-3B model and 0.7867 on the StarCoder2-7B model, consistently outperforming the implemented probability-based baselines Loss, Min-K% Prob, and PAC. Our findings demonstrate that focusing on human-centric coding anomalies provides a significantly more robust indicator of verbatim memorization than sequence-level probability averages.
翻译:随着大型代码语言模型(LLM)日益依赖海量且常为非许可授权数据集,通过成员推理攻击(MIA)评估数据污染问题变得至关重要。本文提出SERSEM(面向成员推理的选择性熵加权评分),一种新颖的白盒攻击框架,通过抑制无信息价值的语法模板来放大特定记忆信号。SERSEM采用双信号方法:首先,通过静态抽象语法树(AST)分析、基于拼写检查的多语言逻辑检测和离线代码检查推导出连续字符级权重掩码;其次,利用这些启发式权重汇聚内部Transformer激活值,并校准输出对数概率的令牌级Z分数。在包含25,000个样本的平衡数据集上评估,SERSEM在StarCoder2-3B模型上达到0.7913的全局AUC-ROC,在StarCoder2-7B模型上达到0.7867,始终优于所实现的基于概率的基准方法Loss、Min-K% Prob和PAC。我们的研究结果表明,聚焦人类编码异常特征相比序列级概率平均值,能为逐字记忆提供显著更稳健的指示信号。