Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to detect using standard trajectory-level monitoring. Existing approaches either evaluate complete trajectories in a single pass or partition them into independently scored windows, limiting their ability to connect evidence across temporally distant actions. We propose TRACE, a monitoring framework for long-horizon LLM agent trajectories. TRACE operates through a TIJ (Triage-Inspect-Judge) loop that identifies high-signal regions, performs targeted inspection while maintaining accumulated evidence across reasoning steps, and synthesizes a trajectory-level verdict. We evaluate TRACE on ten task domains from SHADE-Arena against state-of-the-art baselines. TRACE achieves an aggregate F1 of 0.713 and recall of 0.844, with the largest gains on tasks requiring long-range evidence linking.
翻译:自主LLM智能体可以通过一系列表面上无害的个体行为追求隐藏的恶意目标,这使得标准轨迹级监控难以检测其破坏行为。现有方法要么一次性评估完整轨迹,要么将其划分为独立评分的窗口,从而限制了连接时序上相距较远的证据的能力。我们提出TRACE——一种面向长程LLM智能体轨迹的监控框架。TRACE通过TIJ(分诊-审查-判决)循环运行,该循环识别高信号区域,在累积跨推理步骤证据的同时执行定向审查,并综合生成轨迹级裁决。我们在SHADE-Arena的十个任务领域上针对最先进的基线方法评估TRACE。TRACE的总体F1达到0.713,召回率达到0.844,在需要长程证据连接的任务上取得了最大增益。