Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance

Recent attacks show that behavioural unlearning of large language models leaves internal traces recoverable by adversarial probes. We characterise where this retention lives and show it can be surgically removed without measurable capability cost. Our central protocol is a leave-one-out cross-sequence probe that tests whether a memorisation signature generalises across held-out sequences. The signature is real and consistent across scale: memorisation-specific gaps of +0.32, +0.19, +0.30 on Pythia-70M, GPT-2 medium, and Mistral-7B; on Pythia-70M, the random-initialisation control collapses to -0.04 at the deepest layer where the pretrained signature peaks. The probe direction is causally separable from recall -- projecting it out collapses the signature locally (+0.44 -> -0.19) while behavioural recall barely changes -- and a probe trained on naturally memorised content does not classify fine-tuning-injected secrets, marking two representationally distinct regimes. We then introduce probe-geometry alignment (PGA), a surgical erasure that aligns activations along the probe's live readout direction at each depth. PGA drives the cross-sequence probe below random chance at all four scales tested (toy depth-4: 0.17; Pythia-70M: 0.07; Mistral-7B: 0.45; GPT-2 medium: 0.06 via MD-PGA k=2) and remains robust to six adversarial probe variants. Against a re-fitting attacker who trains a fresh probe on PGA-treated activations, we extend PGA adversarially, defeating the re-fit probe at every memorisation-relevant depth while preserving five zero-shot capability benchmarks within 2.8 percentage points per task (mean Δacc = +0.2pp). The cross-sequence signature is a real, causally separable, regime-specific property of pretrained representations -- removable below chance with a single rank-one intervention per depth at no measurable capability cost.

翻译：近期攻击表明，大语言模型的行为遗忘会在内部留下可被对抗性探针恢复的痕迹。我们刻画了这些保留痕迹的分布位置，并证明其可在无显著能力损失的前提下被精准移除。核心方案是留一法跨序列探针——用于检验记忆痕迹能否泛化至留存序列。该痕迹真实存在且随模型规模一致呈现：在Pythia-70M、GPT-2 medium与Mistral-7B上，记忆特异性间隙分别为+0.32、+0.19、+0.30；在Pythia-70M上，当预训练模型痕迹达到峰值的最深层，随机初始化对照组骤降至-0.04。探针方向与记忆召回具有因果可分离性——投影消除该方向可使局部痕迹从+0.44降至-0.19，而行为召回率几乎不变；基于自然记忆内容训练的探针无法分类微调注入的秘密信息，表明存在两种表征截然不同的机制。我们进而提出探针-几何对齐（PGA），一种在各深度沿探针活动读出方向对齐激活值的精准擦除法。PGA在所有四个测试尺度上将跨序列探针性能驱动至随机猜测水平以下（玩具模型深度4层：0.17；Pythia-70M：0.07；Mistral-7B：0.45；GPT-2 medium经MD-PGA k=2：0.06），且对六种对抗性探针变体保持鲁棒。面对在PGA处理后的激活值上重新训练探针的再拟合攻击，我们进一步提出对抗性PGA，在每层记忆相关深度抑制再拟合探针，同时保持五项零样本能力基准每任务偏差不超过2.8个百分点（平均准确率变化+0.2个百分点）。跨序列记忆痕迹是预训练表征中真实存在、因果可分离且具有机制特异性的一项属性——通过每层单次秩一干预即可将其消除至随机水平以下，且无显著能力损失。