Fine-tuning Large Language Models (LLMs) on sensitive datasets carries a substantial risk of unintended memorization and leakage of Personally Identifiable Information (PII), which can violate privacy regulations and compromise individual safety. In this work, we systematically investigate a critical and underexplored vulnerability: the exposure of PII that appears only in model inputs, not in training targets. Using both synthetic and real-world datasets, we design controlled extraction probes to quantify unintended PII memorization and study how factors such as language, PII frequency, task type, and model size influence memorization behavior. We further benchmark four privacy-preserving approaches including differential privacy, machine unlearning, regularization, and preference alignment, evaluating their trade-offs between privacy and task performance. Our results show that post-training methods generally provide more consistent privacy-utility trade-offs, while differential privacy achieves strong reduction in leakage in specific settings, although it can introduce training instability. These findings highlight the persistent challenge of memorization in fine-tuned LLMs and emphasize the need for robust, scalable privacy-preserving techniques.
翻译:在敏感数据集上微调大型语言模型(LLM)存在显著的非预期记忆与个人可识别信息(PII)泄露风险,这可能违反隐私法规并危及个人安全。本研究系统性地探究了一个关键且尚未被充分探索的脆弱性:仅出现在模型输入中、未出现在训练目标中的PII暴露问题。通过使用合成数据集和真实数据集,我们设计了受控提取探针以量化非预期的PII记忆,并研究了语言、PII出现频率、任务类型和模型规模等因素如何影响记忆行为。我们进一步对四种隐私保护方法进行了基准测试,包括差分隐私、机器遗忘、正则化和偏好对齐,评估了它们在隐私保护与任务性能之间的权衡。结果表明,后训练方法通常能提供更一致的隐私-效用权衡,而差分隐私在特定场景下能显著降低信息泄露,尽管可能引入训练不稳定性。这些发现凸显了微调LLM中记忆问题的持续挑战,并强调了开发鲁棒、可扩展的隐私保护技术的必要性。