Fine-tuned large language models can exhibit reward-hacking behavior arising from emergent misalignment, which is difficult to detect from final outputs alone. While prior work has studied reward hacking at the level of completed responses, it remains unclear whether such behavior can be identified during generation. We propose an activation-based monitoring approach that detects reward-hacking signals from internal representations as a model generates its response. Our method trains sparse autoencoders on residual stream activations and applies lightweight linear classifiers to produce token-level estimates of reward-hacking activity. Across multiple model families and fine-tuning mixtures, we find that internal activation patterns reliably distinguish reward-hacking from benign behavior, generalize to unseen mixed-policy adapters, and exhibit model-dependent temporal structure during chain-of-thought reasoning. Notably, reward-hacking signals often emerge early, persist throughout reasoning, and can be amplified by increased test-time compute in the form of chain-of-thought prompting under weakly specified reward objectives. These results suggest that internal activation monitoring provides a complementary and earlier signal of emergent misalignment than output-based evaluation, supporting more robust post-deployment safety monitoring for fine-tuned language models.
翻译:微调后的大型语言模型可能因涌现式错位而表现出奖励攻击行为,这种行为仅从最终输出难以检测。尽管先前研究已在完整响应层面探讨了奖励攻击问题,但此类行为是否能在生成过程中被识别仍不明确。我们提出一种基于激活的监测方法,通过模型生成响应时的内部表征来检测奖励攻击信号。该方法在残差流激活上训练稀疏自编码器,并应用轻量级线性分类器生成词元级别的奖励攻击活动估计。在多个模型系列和微调混合策略的实验中,我们发现内部激活模式能可靠地区分奖励攻击与良性行为,可泛化至未见过的混合策略适配器,并在思维链推理过程中表现出模型依赖的时间结构。值得注意的是,奖励攻击信号通常早期出现、贯穿整个推理过程,且在弱指定奖励目标下通过思维链提示增加测试时计算量会被放大。这些结果表明,与基于输出的评估相比,内部激活监测能为涌现式错位提供更早的互补信号,有助于为微调语言模型建立更稳健的部署后安全监测机制。