Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards. We investigate this phenomenon and identify a "Perplexity Paradox": spurious RLVR triggers a divergence where answer-token perplexity drops while prompt-side coherence degrades, suggesting the model is bypassing reasoning in favor of memorization. Using Path Patching, Logit Lens, JSD analysis, and Neural Differential Equations, we uncover a hidden Anchor-Adapter circuit that facilitates this shortcut. We localize a Functional Anchor in the middle layers (L18-20) that triggers the retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate the shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows for bidirectional causal steering-artificially amplifying or suppressing contamination-driven performance. Our results provide a mechanistic roadmap for identifying and mitigating data contamination in RLVR-tuned models. Code is available at https://github.com/idwts/How-RLVR-Activates-Memorization-Shortcuts.
翻译:基于可验证奖励的强化学习(RLVR)在增强大语言模型推理能力方面极为有效,但近期证据表明,即使使用虚假或错误的奖励,Qwen 2.5等模型仍能取得显著性能提升。我们研究了这一现象,并识别出一种"困惑度悖论":虚假RLVR会引发一种分歧,即答案标记的困惑度下降,而提示侧的连贯性却随之恶化,这表明模型正在绕过推理过程而倾向于记忆。通过使用路径修补、Logit Lens、JSD分析和神经微分方程,我们发现了一个促进此捷径的隐藏锚点-适配器电路。我们定位了位于中间层(L18-20)的一个功能锚点,它触发对记忆解决方案的检索,随后由后续层(L21+)中的结构适配器将表征进行转换以适应捷径信号。最后,我们证明通过缩放该电路内特定的MLP键,可以实现双向因果调控——人为地放大或抑制由数据污染驱动的性能。我们的研究结果为识别和缓解RLVR调优模型中的数据污染问题提供了一份机制层面的路线图。代码发布于 https://github.com/idwts/How-RLVR-Activates-Memorization-Shortcuts。