We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.
翻译:我们发现语言模型中存在一种新现象:前沿模型的良性微调可能导致隐私崩溃。研究表明,训练数据中多样且微妙的模式会削弱语境隐私保护能力,包括对助益性的优化、用户信息的暴露、情感化与主观性对话、调试代码时打印内部变量等多种情形。经微调的模型会丧失对语境隐私规范的推理能力,不适当地向工具共享信息,并跨越语境边界违反记忆隔离原则。隐私崩溃是一种“静默失效”,因为模型在标准安全性与效用基准测试中保持高性能的同时,却表现出严重的隐私脆弱性。我们的实验在六种模型(闭源与开源权重)、五个微调数据集(真实场景与受控数据)以及两类任务范畴(智能体任务与记忆型任务)中均观察到隐私崩溃的证据。机制分析表明,与得以保留的任务相关特征相比,隐私表征对微调具有独特的脆弱性。本研究结果揭示了当前安全评估体系存在的重大缺陷,特别是在专用智能体的部署方面。