Powerful Training-Free Membership Inference Against Autoregressive Language Models

Fine-tuned language models pose significant privacy risks, as they may memorize and expose sensitive information from their training data. Membership inference attacks (MIAs) provide a principled framework for auditing these risks, yet existing methods achieve limited detection rates, particularly at the low false-positive thresholds required for practical privacy auditing. We present EZ-MIA, a membership inference attack that exploits a key observation: memorization manifests most strongly at error positions, specifically tokens where the model predicts incorrectly yet still shows elevated probability for training examples. We introduce the Error Zone (EZ) score, which measures the directional imbalance of probability shifts at error positions relative to a pretrained reference model. This principled statistic requires only two forward passes per query and no model training of any kind. On WikiText with GPT-2, EZ-MIA achieves 3.8x higher detection than the previous state-of-the-art under identical conditions (66.3% versus 17.5% true positive rate at 1% false positive rate), with near-perfect discrimination (AUC 0.98). At the stringent 0.1% FPR threshold critical for real-world auditing, we achieve 8x higher detection than prior work (14.0% versus 1.8%), requiring no reference model training. These gains extend to larger architectures: on AG News with Llama-2-7B, we achieve 3x higher detection (46.7% versus 15.8% TPR at 1% FPR). These results establish that privacy risks of fine-tuned language models are substantially greater than previously understood, with implications for both privacy auditing and deployment decisions. Code is available at https://github.com/JetBrains-Research/ez-mia.

翻译：微调后的语言模型存在显著的隐私风险，因为它们可能记忆并泄露训练数据中的敏感信息。成员推理攻击为审计此类风险提供了一个原则性框架，然而现有方法的检测率有限，尤其是在实际隐私审计所需的低误报率阈值下。我们提出了EZ-MIA，一种成员推理攻击方法，其基于一个关键观察：记忆效应在错误位置表现最为明显，具体指模型预测错误但对训练样本仍表现出较高概率的那些词元。我们引入了错误区域分数，该分数衡量了相对于预训练参考模型，在错误位置上概率偏移的方向性不平衡程度。这一原则性统计量仅需对每个查询进行两次前向传播，且无需任何形式的模型训练。在WikiText数据集上使用GPT-2进行的实验中，在相同条件下（1%误报率时），EZ-MIA的检测率比先前最优方法提高了3.8倍（真阳性率66.3%对17.5%），并具备近乎完美的区分能力（AUC为0.98）。在对现实世界审计至关重要的严格0.1%误报率阈值下，我们的检测率比先前工作提高了8倍（14.0%对1.8%），且无需参考模型训练。这些优势扩展到更大的架构：在AG News数据集上使用Llama-2-7B，我们在1%误报率下实现了3倍更高的检测率（真阳性率46.7%对15.8%）。这些结果表明，微调语言模型的隐私风险远高于先前的认知，这对隐私审计和部署决策均具有重要影响。代码发布于 https://github.com/JetBrains-Research/ez-mia。