Learning the Signature of Memorization in Autoregressive Language Models

All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer-based models. It transfers zero-shot to Mamba (state-space), RWKV-4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held-out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross-entropy loss. Even simple likelihood-based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT-MIA), captures this signal most effectively by reframing membership inference as sequence classification over per-token distributional statistics. On transformers, LT-MIA achieves 2.8$\times$ higher TPR at 0.1\% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains-Research/learned-mia.

翻译：所有针对微调语言模型的成员推理攻击均依赖于手工设计的启发式策略（如损失阈值法、Min-K%法、参考校准法），这些方法均受限于设计者的直觉。我们首次提出一种可迁移的学习型攻击方法，其依据在于：对任意模型在任意语料上进行微调均可产生无限量的标注数据——因为成员关系已知且可构造。这消除了影子模型瓶颈，将成员推理带入深度学习时代：通过训练多样性与规模实现泛化，从而学习关键特征而非人为设计。我们发现，对语言模型进行微调会形成一种在各类架构与数据域中均可检测到的持久记忆效应特征。我们仅在基于Transformer的模型上训练成员推理分类器，该分类器可零样本迁移至Mamba（状态空间模型）、RWKV-4（线性注意力机制）和RecurrentGemma（门控循环模型），分别达到0.963、0.972和0.936的AUC值。每个评估场景均组合了训练中从未见过的全新架构与数据集，然而三者性能均超越保留的Transformer模型（AUC为0.908）。这四个模型家族共享的计算机制仅是通过交叉熵损失进行梯度下降，而即便简单的基于似然的方法也展现出强迁移能力，证实了该特征独立于检测方法而存在。我们的方法LT-MIA（可迁移学习型成员推理攻击）通过将成员推理重新定义为基于词元分布统计量的序列分类任务，最有效地捕获了这一信号。在Transformer模型上，LT-MIA在0.1%误报率下的真正例率比最强基线方法高出2.8倍。尽管仅使用自然语言文本进行训练，该方法仍能迁移至代码领域（AUC达0.865）。代码及训练后的分类器已开源至https://github.com/JetBrains-Research/learned-mia。