Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance

Recent attacks show that behavioural unlearning of large language models leaves internal traces recoverable by adversarial probes. We characterise where this retention lives and show it can be surgically removed without measurable capability cost. Our central protocol is a leave-one-out cross-sequence probe that tests whether a memorisation signature generalises across held-out sequences. The signature is real and consistent across scale: memorisation-specific gaps of +0.32, +0.19, +0.30 on Pythia-70M, GPT-2 medium, and Mistral-7B; on Pythia-70M, the random-initialisation control collapses to -0.04 at the deepest layer where the pretrained signature peaks. The probe direction is causally separable from recall -- projecting it out collapses the signature locally (+0.44 -> -0.19) while behavioural recall barely changes -- and a probe trained on naturally memorised content does not classify fine-tuning-injected secrets, marking two representationally distinct regimes. We then introduce probe-geometry alignment (PGA), a surgical erasure that aligns activations along the probe's live readout direction at each depth. PGA drives the cross-sequence probe below random chance at all four scales tested (toy depth-4: 0.17; Pythia-70M: 0.07; Mistral-7B: 0.45; GPT-2 medium: 0.06 via MD-PGA k=2) and remains robust to six adversarial probe variants. Against a re-fitting attacker who trains a fresh probe on PGA-treated activations, we extend PGA adversarially, defeating the re-fit probe at every memorisation-relevant depth while preserving five zero-shot capability benchmarks within 2.8 percentage points per task (mean Δacc = +0.2pp). The cross-sequence signature is a real, causally separable, regime-specific property of pretrained representations -- removable below chance with a single rank-one intervention per depth at no measurable capability cost.

翻译：近期研究表明，大模型的行为遗忘会留下内部痕迹，这些痕迹可通过对抗性探针被恢复。我们刻画了这些记忆残留的分布位置，并证明可以在不造成可测量能力损失的情况下将其精准移除。核心方案是采用留一法跨序列探针，检测记忆化特征是否泛化至保留序列。实验表明该特征具有真实性和跨规模一致性：在Pythia-70M、GPT-2中等和Mistral-7B模型中，记忆化特异性差距分别为+0.32、+0.19和+0.30；在Pythia-70M的最深层（预训练特征峰值处），随机初始化对照组差距降至-0.04。探针方向与记忆提取具有因果可分性——投影去除该方向后，局部特征从+0.44骤降至-0.19，而行为层面的回忆几乎没有变化；同时，基于自然记忆内容训练的探针无法分类微调注入的秘密信息，揭示了两个表征分离的机制。我们随后提出探针-几何对齐（PGA）方法，通过沿各深度探针活体读出方向对齐激活值实现精准擦除。在全部四种测试规模下，PGA将跨序列探针性能降至随机水平以下（玩具深度4模型：0.17；Pythia-70M：0.07；Mistral-7B：0.45；GPT-2中等：通过MD-PGA k=2达到0.06），并对六种对抗性探针变体保持鲁棒性。针对重拟合攻击者（在PGA处理后的激活值上训练新探针），我们提出对抗性PGA扩展，在所有记忆相关深度击败重拟合探针，同时将五项零样本能力基准的每任务精度波动控制在2.8个百分点以内（平均Δacc = +0.2pp）。跨序列特征是预训练表征中真实存在、因果可分且具有机制特异性的属性——通过每深度单次秩一干预即可将其降至随机水平以下，且无任何可测量的能力损失。