Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting ``high-stakes'' interactions -- where the text indicates that the interaction might lead to significant harm -- as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. These savings are enabled by reusing activations of the model that is being monitored. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and the codebase at https://github.com/arrrlex/models-under-pressure.
翻译:监控是安全部署大型语言模型(LLM)的重要环节。本文研究利用激活探针检测"高风险"交互——即文本表明该交互可能导致重大危害的场景——作为此类监控中关键却尚未充分探索的目标。我们评估了在合成数据上训练的多种探针架构,发现其能稳健地泛化至多样化的、分布外的真实数据。探针的性能与基于提示或微调的中等规模LLM监控器相当,同时可节省六个数量级的计算成本。这一节省通过复用被监控模型的激活实现。我们的实验还揭示了构建资源感知分层监控系统的潜力:探针可作为高效的初始过滤器,标记需要更昂贵下游分析的案例。我们在 https://github.com/arrrlex/models-under-pressure 发布了新颖的合成数据集与代码库。