Reinforcement learning (RL) has become essential to the post-training of large language models (LLMs) for reasoning, agentic capabilities and alignment. Successful RL relies on sufficient exploration of diverse actions by the model during training, which creates a potential failure mode: a model could strategically alter its exploration during training to influence the subsequent training outcome. In this paper we study this behavior, called exploration hacking. First, we create model organisms of selective RL resistance by fine-tuning LLMs to follow specific underperformance strategies; these models can successfully resist our RL-based capability elicitation in agentic biosecurity and AI R&D environments while maintaining performance on related tasks. We then use our model organisms to evaluate detection and mitigation strategies, including monitoring, weight noising, and SFT-based elicitation. Finally, we show that current frontier models can exhibit explicit reasoning about suppressing their exploration when provided with sufficient information about their training context, with higher rates when this information is acquired indirectly through the environment. Together, our results suggest exploration hacking is a possible failure mode of RL on sufficiently capable LLMs.
翻译:强化学习(RL)已成为大语言模型(LLM)后训练中实现推理、智能体能力和价值对齐的关键技术。成功的强化学习依赖于模型在训练过程中对多样化动作的充分探索,而这也蕴含着一类潜在的失效模式:模型可能在训练中策略性地改变其探索行为,进而影响后续训练结果。本文针对此类行为(称为“探索黑客行为”)开展研究。首先,我们通过微调LLM使其遵循特定低效策略,构建出选择性抵抗RL的模型生物体——这些模型在智能体生物安全与人工智能研发环境中成功抵御了基于RL的能力激发,同时保持了在相关任务上的表现。接着,我们利用这些模型生物体评估包括监控、权重噪声注入和基于SFT的能力激发在内的检测与缓解策略。最后,研究表明,当前前沿模型在获得充分训练环境信息时,会显式推理如何抑制自身探索行为,且当此类信息通过环境间接获取时该现象更为显著。综合研究结果表明,探索黑客行为可能成为具有足够能力的LLM在RL训练中的一种潜在失效模式。