Constitutional Black-Box Monitoring for Scheming in LLM Agents

Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.

翻译：在自主环境中安全部署大型语言模型（LLM）智能体需要可靠的监督机制。一个核心挑战是检测密谋行为，即智能体暗中追求与目标不一致的意图。缓解此类风险的一种方法是基于LLM的监控：利用语言模型检查智能体行为中的可疑操作。我们研究基于宪法的黑盒监控器：这是一种通过提示构建的分类器，仅利用外部可观察的输入和输出来检测密谋行为，并通过从自然语言行为规范生成的合成数据进行优化。我们引入了两种生成合成智能体轨迹的流程：STRIDE（迭代优化）和Gloom（智能体-环境模拟），并分别生成了1000个样本。我们通过提示搜索、人工优化和自动提示优化，在这些数据集上优化前沿LLM监控器，并在ControlArena（一套基于真实环境的测试平台，智能体在更现实的场景中运行）的7500个保留轨迹上评估性能。我们的结果表明，仅基于合成数据选择的监控器能够泛化到更现实的环境中，捕获有意义的密谋信号。然而，我们发现性能在我们的设置中迅速饱和，简单的提示搜索与更广泛的优化结果相当。突破这一限制不会带来进一步的改进，反而会导致过拟合。

相关内容

黑盒

关注 1

在科学，计算和工程学中，黑盒是一种设备，系统或对象，可以根据其输入和输出（或传输特性）对其进行查看，而无需对其内部工作有任何了解。它的实现是“不透明的”（黑色）。几乎任何事物都可以被称为黑盒：晶体管，引擎，算法，人脑，机构或政府。为了使用典型的“黑匣子方法”来分析建模为开放系统的事物，仅考虑刺激/响应的行为，以推断（未知）盒子。该黑匣子系统的通常表示形式是在该方框中居中的数据流程图。黑盒的对立面是一个内部组件或逻辑可用于检查的系统，通常将其称为白盒（有时也称为“透明盒”或“玻璃盒”）。

超越个体智能：基于LLM的多智能体系统中的协作、故障归因与自演化综述

专知会员服务

19+阅读 · 5月16日

从静态模板到动态运行时图：大语言模型智能体（LLM Agents）工作流优化综述

专知会员服务

23+阅读 · 3月30日

管理 LLM 智能体中的演进式记忆：风险、机理及稳定性与安全性受控记忆（SSGM）框架

专知会员服务

16+阅读 · 3月14日

法律领域中的大语言模型智能体：分类体系、应用场景与挑战

专知会员服务

17+阅读 · 1月14日