To identify safety violations, auditors often search over large sets of agent traces. This search is difficult because failures are often rare, complex, and sometimes even adversarially hidden and only detectable when multiple traces are analyzed together. These challenges arise in diverse settings such as misuse campaigns, covert sabotage, reward hacking, and prompt injection. Existing approaches struggle here for several reasons. Per-trace judges miss failures that only become visible across traces, naive agentic auditing does not scale to large trace collections, and fixed monitors are brittle to unanticipated behaviors. We introduce Meerkat, which combines clustering with agentic search to uncover violations specified in natural language. Through structured search and adaptive investigation of promising regions, Meerkat finds sparse failures without relying on seed scenarios, fixed workflows, or exhaustive enumeration. Across misuse, misalignment, and task gaming settings, Meerkat significantly improves detection of safety violations over baseline monitors, discovers widespread developer cheating on a top agent benchmark, and finds nearly 4x more examples of reward hacking on CyBench than previous audits.
翻译:为识别安全违规行为,审计人员常常需要在大规模智能体轨迹集合中进行搜索。这一过程极具挑战性,因为违规行为往往具有稀疏性、复杂性,有时甚至被恶意隐藏,且仅当多个轨迹被联合分析时才能被发现。这些挑战普遍存在于滥用活动、隐蔽破坏、奖励黑客攻击及提示注入等多种场景中。现有方法在此类场景下存在不足:单轨迹裁判无法识别仅跨轨迹可见的违规行为,朴素智能体审计难以扩展至大规模轨迹集合,而固定监控器则对未预见行为缺乏鲁棒性。我们提出Meerkat方法,该方法将聚类与智能体搜索相结合,用于发现以自然语言描述的违规行为。通过结构化搜索及对潜在区域的适应性探查,Meerkat无需依赖种子场景、固定工作流或穷举枚举即可发现稀疏违规行为。在滥用、行为偏差及任务策略操控等多种场景中,Meerkat相较于基线监控器显著提升了安全违规的检测能力,成功发现顶级智能体基准测试中广泛存在的开发者作弊现象,并在CyBench上检测到的奖励黑客攻击案例数量较以往审计工作提升了近4倍。