Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance

Recent attacks show that behavioural unlearning of large language models leaves internal traces recoverable by adversarial probes. We characterise where this retention lives and show it can be surgically removed without measurable capability cost. Our central protocol is a leave-one-out cross-sequence probe that tests whether a memorisation signature generalises across held-out sequences. The signature is real and consistent across scale: memorisation-specific gaps of +0.32, +0.19, +0.30 on Pythia-70M, GPT-2 medium, and Mistral-7B; on Pythia-70M, the random-initialisation control collapses to -0.04 at the deepest layer where the pretrained signature peaks. The probe direction is causally separable from recall -- projecting it out collapses the signature locally (+0.44 -> -0.19) while behavioural recall barely changes -- and a probe trained on naturally memorised content does not classify fine-tuning-injected secrets, marking two representationally distinct regimes. We then introduce probe-geometry alignment (PGA), a surgical erasure that aligns activations along the probe's live readout direction at each depth. PGA drives the cross-sequence probe below random chance at all four scales tested (toy depth-4: 0.17; Pythia-70M: 0.07; Mistral-7B: 0.45; GPT-2 medium: 0.06 via MD-PGA k=2) and remains robust to six adversarial probe variants. Against a re-fitting attacker who trains a fresh probe on PGA-treated activations, we extend PGA adversarially, defeating the re-fit probe at every memorisation-relevant depth while preserving five zero-shot capability benchmarks within 2.8 percentage points per task (mean Δacc = +0.2pp). The cross-sequence signature is a real, causally separable, regime-specific property of pretrained representations -- removable below chance with a single rank-one intervention per depth at no measurable capability cost.

翻译：最新攻击表明，对大型语言模型的行为遗忘会在内部留下可被对抗性探针恢复的痕迹。我们刻画了这些记忆痕迹的留存位置，并证明可在不造成可测量能力损失的前提下对其进行精准移除。核心方案是采用留一法跨序列探针，检测记忆化特征是否泛化至未参与训练的序列。该特征真实存在且跨规模一致：在Pythia-70M、GPT-2-Mid和Mistral-7B上，记忆化特定间隔分别为+0.32、+0.19和+0.30；在Pythia-70M最深层（预训练特征峰值处），随机初始化对照组降至-0.04。探针方向与记忆召回具有因果可分性——投影消除该方向后，局部特征从+0.44骤降至-0.19，而行为召回几乎不变；且基于自然记忆内容训练的探针无法分类微调注入的秘密，揭示了两个表征分离的机制。我们进而提出探针-几何对齐（PGA）方法，通过在各深度沿探针实时读取方向对齐激活值实现精准擦除。PGA在所有四个测试规模下均将跨序列探针性能压至随机水平以下（玩具模型深度4:0.17；Pythia-70M:0.07；Mistral-7B:0.45；GPT-2-Mid通过MD-PGA（k=2）达0.06），且对六种对抗性探针变体保持鲁棒。针对利用PGA处理后的激活值重新训练探针的重拟合攻击者，我们通过对抗性扩展PGA，在记忆相关所有深度击败重拟合探针，同时将五个零样本能力基准维持在每任务2.8个百分点以内（平均准确率变化Δacc=+0.2pp）。跨序列特征作为预训练表征的真实、因果可分离且机制特定的属性，可通过每深度单一次秩干预被擦除至随机水平以下，且不造成可测量的能力损失。