Large language models (LLMs) can memorize and reproduce training sequences verbatim -- a tendency that undermines both generalization and privacy. Existing mitigation methods apply interventions uniformly, degrading performance on the majority of tokens that generalize normally. We show empirically that memorization is sparse, intermittent, and token-conditioned, suggesting that effective mitigation requires context-aware intervention rather than static parameter modification. To this end, we propose a novel and effective selective memorization mitigation method -- Gated Subspace Steering (GSS), which decomposes intervention into a probe (detecting memorization-relevant activations) and a steer (applying targeted correction only when the probe exceeds a threshold). The optimal probe-steer pair emerges from a principled optimization framework based on optimal subspace steering. Experiments on four benchmarks show GSS matches or exceeds state-of-the-art memorization reduction while requiring $100-1000 \times$ less compute than optimization-based alternatives. Furthermore, we provide new theoretical insights into the geometry of memorization in neural representations.
翻译:大型语言模型(LLMs)能够逐字记忆并复现训练序列——这种倾向会损害模型的泛化能力与隐私安全性。现有的缓解方法通常采用均匀干预,这会降低对大多数正常泛化的词元的性能。我们通过实验证明,记忆现象具有稀疏性、间歇性及词元条件依赖性,这表明有效的缓解需要基于上下文感知的干预,而非静态的参数修改。为此,我们提出一种新颖且有效的选择性记忆缓解方法——门控子空间导向(GSS),该方法将干预分解为探测(检测与记忆相关的激活)和导向(仅在探测值超过阈值时施加针对性校正)。最优的探测-导向对源于基于最优子空间导向的原理性优化框架。在四个基准测试上的实验表明,GSS在记忆减少效果上达到或超越了现有最优方法,同时所需计算量比基于优化的替代方案少 $100-1000$ 倍。此外,我们为神经网络表示中记忆的几何结构提供了新的理论见解。