Membership inference attacks (MIAs) are a canonical way to assess a machine learning model's privacy properties. Although several attempts have been made to evaluate MIAs on language models, the extant literature has suffered numerous difficulties in constructing clean evaluations to test new techniques. In particular, subtle distribution shifts between member and non-member sets can undermine the statistical validity of MIAs; recent work has underscored this by showing that "blind" methods with no access to the underlying model can perform far better than published methods on the same benchmarks. This paper constructs a benchmark for principled evaluation of MIAs against LLMs, by leveraging the insight that training data before and after a fixed point during training are drawn from the same distribution. Therefore, all open-source models with intermediate checkpoints and public training data can be converted into MIA testbeds. We apply our framework to a half-dozen published attacks on the Pythia and OLMo family of models, from 70M to 7B parameters. To facilitate further privacy research, we open-source a modular library for designing and implementing attacks in this setting: https://github.com/safr-ai-lab/pandora_llm.
翻译:成员推理攻击(MIA)是评估机器学习模型隐私属性的标准方法。尽管已有若干尝试在语言模型上评估MIA,但现有文献在构建用于测试新技术的干净评估时仍面临诸多困难。具体而言,成员集与非成员集之间的细微分布偏移可能破坏MIA的统计有效性;近期研究通过展示“盲”方法(无需访问底层模型)在相同基准上可显著优于已发表方法,进一步凸显了这一问题。本文利用训练过程中固定时间点前后所采训练数据均来自同一分布的洞察,构建了一个针对大语言模型进行原则性评估的MIA基准。因此,所有具有中间检查点的开源模型及公开训练数据均可转化为MIA测试平台。我们将该框架应用于六种已发表的攻击,覆盖参数量从70M到7B的Pythia和OLMo模型系列。为促进隐私研究,我们开源了一个模块化库,用于在该场景下设计与实现攻击:https://github.com/safr-ai-lab/pandora_llm。