知识蒸馏中的成员推理攻击研究 (On Membership Inference Attacks in Knowledge Distillation)

Large language models (LLMs) are trained on massive corpora that may contain sensitive information, creating privacy risks under membership inference attacks (MIAs). Knowledge distillation is widely used to compress LLMs into smaller student models, but its privacy implications are poorly understood. We systematically evaluate how distillation affects MIA vulnerability across six teacher-student model pairs and six attack methods. We find that distilled student models do not consistently exhibit lower MIA success than their teacher models, and in some cases demonstrate substantially higher member-specific attack success, challenging the assumption that knowledge distillation inherently improves privacy. We attribute this to mixed supervision in distillation: for vulnerable training data points, teacher predictions often align with ground-truth labels, causing student models to learn overly confident predictions that amplify the separability between members and non-members; conversely, for non-vulnerable points, teacher predictions and ground truth frequently diverge, providing inconsistent learning signals. To mitigate this, we propose three practical interventions -- restricting distillation to non-vulnerable points, adding a low-dimensional Bottleneck Projection, and a normalization variant (NoNorm). Experiments show these methods reduce both aggregate and member-specific MIA success while preserving model utility, improving privacy-utility trade-offs for distilled LLMs.

翻译：大型语言模型（LLMs）在可能包含敏感信息的大规模语料库上进行训练，这使其在成员推理攻击（MIAs）下面临隐私风险。知识蒸馏被广泛用于将LLMs压缩为更小的学生模型，但其隐私影响尚未得到充分理解。我们系统评估了蒸馏过程如何影响六个师生模型对和六种攻击方法下的MIA脆弱性。研究发现，蒸馏得到的学生模型并未一致表现出比其教师模型更低的MIA成功率，在某些情况下甚至显示出显著更高的成员特异性攻击成功率，这挑战了“知识蒸馏本质上能提升隐私性”的假设。我们将此归因于蒸馏中的混合监督机制：对于脆弱的训练数据点，教师模型的预测常与真实标签一致，导致学生模型学习到过度自信的预测，从而放大了成员与非成员之间的可区分性；反之，对于非脆弱数据点，教师预测与真实值经常存在分歧，提供了不一致的学习信号。为缓解此问题，我们提出三种实用干预措施——限制蒸馏仅针对非脆弱数据点、添加低维瓶颈投影层以及一种归一化变体（NoNorm）。实验表明，这些方法在保持模型效用的同时，能降低整体及成员特异性的MIA成功率，从而改善了蒸馏LLMs的隐私-效用权衡关系。