Most membership inference attacks (MIAs) against Large Language Models (LLMs) rely on global signals, like average loss, to identify training data. This approach, however, dilutes the subtle, localized signals of memorization, reducing attack effectiveness. We challenge this global-averaging paradigm, positing that membership signals are more pronounced within localized contexts. We introduce WBC (Window-Based Comparison), which exploits this insight through a sliding window approach with sign-based aggregation. Our method slides windows of varying sizes across text sequences, with each window casting a binary vote on membership based on loss comparisons between target and reference models. By ensembling votes across geometrically spaced window sizes, we capture memorization patterns from token-level artifacts to phrase-level structures. Extensive experiments across eleven datasets demonstrate that WBC substantially outperforms established baselines, achieving higher AUC scores and 2-3 times improvements in detection rates at low false positive thresholds. Our findings reveal that aggregating localized evidence is fundamentally more effective than global averaging, exposing critical privacy vulnerabilities in fine-tuned LLMs.
翻译:针对大型语言模型(LLMs)的大多数成员推断攻击(MIAs)依赖全局信号(如平均损失)来识别训练数据。然而,这种方法会稀释记忆的微妙局部信号,从而降低攻击效果。我们挑战这种全局平均范式,提出成员信号在局部上下文中更为显著。我们引入WBC(基于窗口的比较方法),该方法通过基于符号聚合的滑动窗口策略来利用这一洞见。我们的方法在文本序列上滑动不同尺寸的窗口,每个窗口基于目标模型与参考模型之间的损失比较,对成员资格进行二元投票。通过对几何间隔窗口尺寸的投票结果进行集成,我们能够捕获从词元级特征到短语级结构的记忆模式。在十一个数据集上的大量实验表明,WBC显著优于现有基线方法,实现了更高的AUC分数,并在低误报率阈值下将检测率提高了2-3倍。我们的研究结果表明,聚合局部证据从根本上比全局平均更有效,这揭示了微调后LLMs中存在的关键隐私漏洞。