How does information access affect LLM monitors' ability to detect sabotage?

Frontier language model agents can exhibit misaligned behaviors, including deception, exploiting reward hacks, and pursuing hidden objectives. To control potentially misaligned agents, we can use LLMs themselves to monitor for misbehavior. In this paper, we study how information access affects LLM monitor performance. While one might expect that monitors perform better when they have access to more of the monitored agents' reasoning and actions, we find that contemporary systems often perform better with less information, a phenomenon we call the less-is-more effect for automated oversight. We demonstrate this phenomenon, and analyze the conditions under which it occurs, in three evaluation environments where agents must conduct sabotage while evading monitors. Leveraging this effect, we introduce extract-and-evaluate (EaE) monitoring -- a new hierarchical approach where one monitor isolates relevant excerpts from the monitored agent's trajectory and another monitor scores these excerpts. In BigCodeBench-Sabotage with a GPT-4.1-mini monitor, EaE improves sabotage detection rates by 16.8 percentage points over the next-best approach. In other settings, EaE either outperforms or is competitive with baselines. In addition, we find that agents unaware of being monitored can be caught much more easily and that monitors scoring longer reasoning traces gain more from information filtering. Lastly, we conduct a cost-performance analysis and find that Gemini and Claude models with monitoring techniques that involve information filtering occupy much of the Pareto frontier.

翻译：前沿语言模型智能体可能表现出未对齐的行为，包括欺骗、利用奖励漏洞和追求隐藏目标。为控制潜在未对齐的智能体，我们可以使用LLM自身来监控不当行为。本文研究信息获取如何影响LLM监控器的性能。尽管人们可能预期监控器在获取更多被监控智能体的推理过程和行动信息时表现更好，但我们发现当代系统通常在信息较少时表现更优，我们将这一现象称为自动化监督的"少即是多"效应。我们在三个评估环境中证明了这一现象并分析了其发生条件，这些环境要求智能体在实施破坏行为的同时规避监控。利用此效应，我们提出了提取-评估监控——一种新的分层方法，其中一个监控器从被监控智能体的轨迹中隔离相关片段，另一监控器对这些片段进行评分。在BigCodeBench-Sabotage环境中使用GPT-4.1-mini监控器时，EaE将破坏行为检测率较次优方法提升了16.8个百分点。在其他设定中，EaE要么优于基线方法，要么与基线表现相当。此外，我们发现未察觉被监控的智能体更容易被捕获，且对较长推理轨迹进行评分的监控器能从信息过滤中获得更大收益。最后，我们进行了成本-性能分析，发现采用信息过滤监控技术的Gemini和Claude模型占据了帕累托前沿的大部分区域。