We introduce AutoMonitor-Bench, the first benchmark designed to systematically evaluate the reliability of LLM-based misbehavior monitors across diverse tasks and failure modes. AutoMonitor-Bench consists of 3,010 carefully annotated test samples spanning question answering, code generation, and reasoning, with paired misbehavior and benign instances. We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively. Evaluating 12 proprietary and 10 open-source LLMs, we observe substantial variability in monitoring performance and a consistent trade-off between MR and FAR, revealing an inherent safety-utility tension. To further explore the limits of monitor reliability, we construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors. Our results highlight the challenges of reliable, scalable misbehavior monitoring and motivate future work on task-aware designing and training strategies for LLM-based monitors.
翻译:我们提出了AutoMonitor-Bench,这是首个旨在系统评估基于LLM的不良行为监测器在不同任务和故障模式下可靠性的基准。AutoMonitor-Bench包含3,010个经过精心标注的测试样本,涵盖问答、代码生成和推理任务,并配有成对的不良行为实例与良性实例。我们使用两个互补的指标来评估监测器:漏报率(MR)和误报率(FAR),分别捕捉未能检测到不良行为以及对良性行为过度敏感的情况。通过对12个专有模型和10个开源LLM的评估,我们观察到监测性能存在显著差异,并且MR与FAR之间存在一致的权衡关系,揭示了一种固有的安全性与实用性之间的张力。为了进一步探索监测器可靠性的极限,我们构建了一个包含153,581个样本的大规模训练语料库,并对Qwen3-4B-Instruction进行微调,以研究在已知且相对容易构建的不良行为数据集上进行训练,是否能提升对未见及更隐晦不良行为的监测性能。我们的结果凸显了实现可靠、可扩展的不良行为监测所面临的挑战,并激励未来针对基于LLM的监测器进行任务感知的设计和训练策略研究。