Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful systems. Prior work has shown that activation probes may be a promising misuse mitigation technique, but we identify a key remaining challenge: probes fail to generalize under important production distribution shifts. In particular, we find that the shift from short-context to long-context inputs is difficult for existing probe architectures. We propose several new probe architecture that handle this long-context distribution shift. We evaluate these probes in the cyber-offensive domain, testing their robustness against various production-relevant shifts, including multi-turn conversations, static jailbreaks, and adaptive red teaming. Our results demonstrate that while multimax addresses context length, a combination of architecture choice and training on diverse distributions is required for broad generalization. Additionally, we show that pairing probes with prompted classifiers achieves optimal accuracy at a low cost due to the computational efficiency of probes. These findings have informed the successful deployment of misuse mitigation probes in user-facing instances of Gemini, Google's frontier language model. Finally, we find early positive results using AlphaEvolve to automate improvements in both probe architecture search and adaptive red teaming, showing that automating some AI safety research is already possible.
翻译:前沿语言模型的能力正在快速提升。因此,我们需要更强大的缓解措施来防止恶意行为者滥用日益强大的系统。先前的研究表明,激活探针可能是一种有前景的滥用缓解技术,但我们发现一个关键挑战依然存在:探针在重要的生产环境分布偏移下泛化能力不足。具体而言,我们发现从短上下文输入到长上下文输入的转变对于现有探针架构而言尤为困难。我们提出了几种能够处理这种长上下文分布偏移的新探针架构。我们在网络攻击领域评估了这些探针,测试了它们针对多种生产相关偏移的鲁棒性,包括多轮对话、静态越狱和自适应红队测试。我们的结果表明,虽然multimax方法解决了上下文长度问题,但要实现广泛的泛化,需要结合架构选择和在多样化分布上的训练。此外,我们证明,将探针与提示分类器结合使用,由于探针的计算效率高,能以较低成本实现最优准确率。这些发现已成功指导了滥用缓解探针在Gemini(谷歌的前沿语言模型)面向用户的实例中的部署。最后,我们利用AlphaEvolve在探针架构搜索和自适应红队测试自动化改进方面取得了初步积极成果,这表明部分人工智能安全研究的自动化已成为可能。