Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting ``high-stakes'' interactions -- where the text indicates that the interaction might lead to significant harm -- as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. These savings are enabled by reusing activations of the model that is being monitored. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and the codebase at https://github.com/arrrlex/models-under-pressure.
翻译:监控是安全部署大型语言模型(LLM)的重要环节。本文研究将激活探针用于检测“高风险”交互——即文本表明该交互可能导致重大危害的场景——作为此类监控中一个关键但尚未充分探索的目标。我们评估了在合成数据上训练的多种探针架构,发现它们对多样化、分布外真实数据展现出稳健的泛化能力。探针的性能与基于提示或微调的中等规模LLM监控器相当,同时计算成本降低六个数量级。这种节约是通过复用被监控模型的激活值实现的。我们的实验还凸显了构建资源感知分层监控系统的潜力,其中探针可作为高效的初始过滤器,标记需要更昂贵下游分析的案例。我们在 https://github.com/arrrlex/models-under-pressure 发布了新颖的合成数据集与代码库。