With the widespread adoption of Large Language Models (LLMs) and increasingly stringent privacy regulations, protecting data privacy in LLMs has become essential, especially for privacy-sensitive applications. Membership Inference Attacks (MIAs) attempt to determine whether a specific data sample was included in the model training/fine-tuning dataset, posing serious privacy risks. However, most existing MIA techniques against LLMs rely on sequence-level aggregated prediction statistics, which fail to distinguish prediction improvements caused by generalization from those caused by memorization, leading to low attack effectiveness. To address this limitation, we propose a novel membership inference approach that captures the token-level probabilities for low-confidence (hard) tokens, where membership signals are more pronounced. By comparing token-level probability improvements at hard tokens between a fine-tuned target model and a pre-trained reference model, HT-MIA isolates strong and robust membership signals that are obscured by prior MIA approaches. Extensive experiments on both domain-specific medical datasets and general-purpose benchmarks demonstrate that HT-MIA consistently outperforms seven state-of-the-art MIA baselines. We further investigate differentially private training as an effective defense mechanism against MIAs in LLMs. Overall, our HT-MIA framework establishes hard-token based analysis as a state-of-the-art foundation for advancing membership inference attacks and defenses for LLMs.
翻译:随着大型语言模型(LLMs)的广泛采用和日益严格的隐私法规,保护LLMs中的数据隐私变得至关重要,特别是在隐私敏感的应用场景中。成员推理攻击(MIAs)试图判断特定数据样本是否包含在模型的训练/微调数据集中,这构成了严重的隐私风险。然而,现有针对LLMs的大多数MIA技术依赖于序列级聚合的预测统计量,这些方法无法区分由泛化引起的预测改进与由记忆效应引起的改进,导致攻击效果有限。为克服这一局限,我们提出了一种新颖的成员推理方法,该方法通过捕捉低置信度(困难)标记在词元级别的概率分布来提取更显著的成员特征信号。通过比较微调后的目标模型与预训练参考模型在困难标记处的词元级概率改进,HT-MIA能够分离出先前MIA方法所忽略的强健成员特征信号。在领域特定的医学数据集和通用基准测试上的大量实验表明,HT-MIA始终优于七种最先进的MIA基线方法。我们进一步研究了差分隐私训练作为对抗LLMs中MIA的有效防御机制。总体而言,我们的HT-MIA框架确立了基于困难标记的分析方法,为推进LLMs的成员推理攻击与防御研究奠定了最先进的基础。