信息获取如何影响LLM监控器检测破坏行为的能力？ (How does information access affect LLM monitors' ability to detect sabotage?)

Frontier language model agents can exhibit misaligned behaviors, including deception, exploiting reward hacks, and pursuing hidden objectives. To control potentially misaligned agents, we can use LLMs themselves to monitor for misbehavior. In this paper, we study how information access affects LLM monitor performance. While one might expect that monitors perform better when they have access to more of the monitored agents' reasoning and actions, we find that contemporary systems often perform better with less information, a phenomenon we call the less-is-more effect for automated oversight. We demonstrate this phenomenon, and analyze the conditions under which it occurs, in three evaluation environments where agents must conduct sabotage while evading monitors. Motivated by the less-is-more effect, we introduce extract-and-evaluate (EaE) monitoring--a new hierarchical approach where one monitor isolates relevant excerpts from the monitored agent's trajectory and a separate monitor scores them. In BigCodeBench-Sabotage with GPT-4.1-mini as the monitor model, EaE improves sabotage detection rates by 16.8 percentage points over the next-best monitor without increasing the false positive rate. In other settings, EaE either outperforms or is competitive with baselines. In addition, we find that agents unaware of being monitored can be caught much more easily and that monitors scoring longer reasoning traces gain more from information filtering. Lastly, we conduct a cost-performance analysis and find that Gemini and Claude models with monitoring techniques that involve information filtering occupy much of the Pareto frontier.

翻译：前沿语言模型智能体可能表现出未对齐行为，包括欺骗、利用奖励漏洞及追求隐藏目标。为控制潜在未对齐的智能体，我们可以使用LLM自身来监控不当行为。本文研究信息获取如何影响LLM监控器的性能。尽管人们可能预期监控器在获取更多被监控智能体的推理过程和行为信息时表现更佳，但我们发现当代系统通常在信息较少时表现更好，这一现象我们称之为自动化监督的"少即是多"效应。我们在三个评估环境中论证了该现象并分析其发生条件，这些环境要求智能体在实施破坏行为的同时规避监控。受"少即是多"效应启发，我们提出提取-评估监控法——一种新型分层监控方法：首个监控器从被监控智能体的行为轨迹中隔离相关片段，再由独立监控器对其进行评分。在BigCodeBench-Sabotage环境中使用GPT-4.1-mini作为监控模型时，EaE方法将破坏行为检测率较次优监控器提升16.8个百分点，且未增加误报率。在其他实验场景中，EaE方法均优于或与基线方法持平。此外，我们发现未察觉被监控的智能体更容易被捕获，且对较长推理轨迹进行评分的监控器更能从信息过滤中获益。最后通过成本-性能分析发现，采用信息过滤技术的Gemini和Claude模型在帕累托前沿中占据主导地位。