Modern language models exhibit rich internal structure, yet little is known about how privacy-sensitive behaviors, such as personally identifiable information (PII) leakage, are represented and modulated within their hidden states. We present UniLeak, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model's residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts. These model-specific directions generalize across contexts and amplify PII generation probability, with minimal impact on generation quality. UniLeak recovers such directions without access to training data or groundtruth PII, relying only on self-generated text. Across multiple models and datasets, steering along these universal directions substantially increases PII leakage compared to existing prompt-based extraction methods. Our results offer a new perspective on PII leakage: the superposition of a latent signal in the model's representations, enabling both risk amplification and mitigation.
翻译:现代语言模型展现出丰富的内部结构,然而对于隐私敏感行为(如个人身份信息泄露)如何在其隐藏状态中表征和调控,目前仍知之甚少。我们提出了UniLeak,一个基于机制可解释性的框架,旨在识别通用激活方向:即模型残差流中的潜在方向,在推理时沿这些方向进行线性叠加,能够持续提高模型在不同提示下生成个人身份信息的可能性。这些模型特定的方向具有跨上下文的泛化能力,并能显著提升个人身份信息的生成概率,同时对生成质量的影响极小。UniLeak无需访问训练数据或真实个人身份信息,仅依靠模型自生成的文本即可恢复此类方向。在多个模型和数据集上的实验表明,沿这些通用方向进行调控,相比现有的基于提示的提取方法,能大幅增加个人身份信息的泄露。我们的研究结果为理解个人身份信息泄露提供了新视角:即模型表征中潜在信号的叠加,这既为风险放大也为风险缓解提供了可能。