Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful systems. Prior work has shown that activation probes may be a promising misuse mitigation technique, but we identify a key remaining challenge: probes fail to generalize under important production distribution shifts. In particular, we find that the shift from short-context to long-context inputs is difficult for existing probe architectures. We propose several new probe architectures that handle this long-context distribution shift. We evaluate these probes in the cyber-offensive domain, testing their robustness against various production-relevant distribution shifts, including multi-turn conversations, long context prompts, and adaptive red teaming. Our results demonstrate that while our novel architectures address context length, a combination of architecture choice and training on diverse distributions is required for broad generalization. Additionally, we show that pairing probes with prompted classifiers achieves optimal accuracy at a low cost due to the computational efficiency of probes. These findings have informed the successful deployment of misuse mitigation probes in user-facing instances of Gemini, Google's frontier language model. Finally, we find early positive results using AlphaEvolve to automate improvements in both probe architecture search and adaptive red teaming, showing that automating some AI safety research is already possible.
翻译:前沿语言模型的能力正在迅速提升。因此,我们需要更强大的缓解措施来防止恶意行为者滥用日益强大的系统。先前研究表明,激活探测可能是一种有前景的滥用缓解技术,但我们发现一个关键挑战:探测模型在重要的生产环境分布偏移下泛化能力不足。特别地,我们发现从短上下文到长上下文的输入偏移对现有探测架构构成显著困难。我们提出了几种能够处理这种长上下文分布偏移的新型探测架构。我们在网络攻击领域评估这些探测模型,测试其针对多种生产相关分布偏移的鲁棒性,包括多轮对话、长上下文提示和自适应红队测试。结果表明,虽然我们的新型架构解决了上下文长度问题,但需要结合架构选择与多样化分布训练才能实现广泛泛化。此外,我们证明将探测模型与提示分类器结合,可凭借探测模型的计算效率以较低成本实现最优准确率。这些发现已成功指导了滥用缓解探测模型在Gemini(谷歌前沿语言模型)用户端实例中的部署。最后,我们利用AlphaEvolve在探测架构搜索和自适应红队测试自动化改进方面取得了初步积极成果,表明部分AI安全研究已具备自动化可能。