Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.
翻译:在自主环境中安全部署大型语言模型(LLM)智能体需要可靠的监督机制。一个核心挑战是检测密谋行为,即智能体暗中追求与目标不一致的意图。缓解此类风险的一种方法是基于LLM的监控:利用语言模型检查智能体行为中的可疑操作。我们研究基于宪法的黑盒监控器:这是一种通过提示构建的分类器,仅利用外部可观察的输入和输出来检测密谋行为,并通过从自然语言行为规范生成的合成数据进行优化。我们引入了两种生成合成智能体轨迹的流程:STRIDE(迭代优化)和Gloom(智能体-环境模拟),并分别生成了1000个样本。我们通过提示搜索、人工优化和自动提示优化,在这些数据集上优化前沿LLM监控器,并在ControlArena(一套基于真实环境的测试平台,智能体在更现实的场景中运行)的7500个保留轨迹上评估性能。我们的结果表明,仅基于合成数据选择的监控器能够泛化到更现实的环境中,捕获有意义的密谋信号。然而,我们发现性能在我们的设置中迅速饱和,简单的提示搜索与更广泛的优化结果相当。突破这一限制不会带来进一步的改进,反而会导致过拟合。