Powerful Training-Free Membership Inference Against Autoregressive Language Models

Fine-tuned language models pose significant privacy risks, as they may memorize and expose sensitive information from their training data. Membership inference attacks (MIAs) provide a principled framework for auditing these risks, yet existing methods achieve limited detection rates, particularly at the low false-positive thresholds required for practical privacy auditing. We present EZ-MIA, a membership inference attack that exploits a key observation: memorization manifests most strongly at error positions, specifically tokens where the model predicts incorrectly yet still shows elevated probability for training examples. We introduce the Error Zone (EZ) score, which measures the directional imbalance of probability shifts at error positions relative to a pretrained reference model. This principled statistic requires only two forward passes per query and no model training of any kind. On WikiText with GPT-2, EZ-MIA achieves 3.8x higher detection than the previous state-of-the-art under identical conditions (66.3% versus 17.5% true positive rate at 1% false positive rate), with near-perfect discrimination (AUC 0.98). At the stringent 0.1% FPR threshold critical for real-world auditing, we achieve 8x higher detection than prior work (14.0% versus 1.8%), requiring no reference model training. These gains extend to larger architectures: on AG News with Llama-2-7B, we achieve 3x higher detection (46.7% versus 15.8% TPR at 1% FPR). These results establish that privacy risks of fine-tuned language models are substantially greater than previously understood, with implications for both privacy auditing and deployment decisions. Code is available at https://github.com/JetBrains-Research/ez-mia.

翻译：微调后的语言模型存在显著的隐私风险，因其可能记忆并泄露训练数据中的敏感信息。成员推断攻击（MIA）为审计此类风险提供了原则性框架，但现有方法在实现有效检测率方面仍存在局限，尤其是在实际隐私审计所需的低误报率阈值下。我们提出EZ-MIA成员推断攻击，该攻击利用一个关键发现：记忆效应在错误位置（即模型预测错误但训练样本概率仍偏高的词元）表现最为强烈。我们提出“错误区域”（Error Zone, EZ）分数，通过测量预训练参考模型在错误位置概率偏移的方向性不平衡来量化这一效应。该原则性统计量每次查询仅需两次前向传播，且无需任何模型训练。在WikiText数据集上针对GPT-2的实验中，EZ-MIA在相同条件下（1%误报率时真正率从17.5%提升至66.3%）检测率较此前最优方法提升3.8倍，且近乎完美区分（AUC 0.98）。在现实审计中关键的0.1%误报率阈值下，检测率较先前研究提升8倍（14.0%对比1.8%），且无需参考模型训练。这些优势可扩展至更大架构：在AG News数据集上针对Llama-2-7B的实验中，我们在1%误报率下将真正率从15.8%提升至46.7%（提升3倍）。上述结果证明，微调语言模型的隐私风险远超此前认知，这对隐私审计和部署决策具有重要启示。代码已开源：https://github.com/JetBrains-Research/ez-mia