As LLM agents begin to take real, irreversible actions (shell commands, file edits, deploys), the standard safety pattern is a human-in-the-loop approval gate: risky actions pause and wait for a person. We argue the gate is the easy part; the hard part is the judgment - which actions to stop - which the field evaluates against two false assumptions: that there is a ground-truth notion of "risky," and that the human reviewer is a perfect, infinitely-available oracle. On a hand-labeled set of 125 adversarially-weighted agent actions we show that (i) reviewers only moderately agree on what is risky (Fleiss' kappa = 0.52), so there is no single correct label; (ii) framing the guard as selective classification under asymmetric cost makes its operating limits measurable, and on hard inputs the guard cannot safely auto-decide; and (iii) when the reviewer is modeled as endogenous (fatiguing as escalation load grows), realized safety becomes an inverted-U in the escalation rate: more human oversight can make a system less safe, and the safety-optimal guard escalates below full escalation - a setting a load-aware policy also uses to resist a flooding attack that slips a malicious action past a fatigued reviewer. Agent oversight, framed this way, is not only a classification problem but a resource-allocation one: human attention is finite, and the guard's escalation policy spends it. We claim none of these mechanisms as novel - fatigue-aware learning-to-defer (FALCON), cost-sensitive deferral under workload constraints (DeCCaF), trajectory-level guarding, and reviewer-fatigue/flooding attacks are all prior art we cite. Our contribution is an open-source agent-oversight system that operationalizes and measures them in the LLM-agent action-gating setting, turning "is my guard good?" from a guess into a curve. The inverted-U and the flooding attack are modeling results that motivate a human study.
翻译:随着基于大型语言模型的代理开始执行真实的、不可逆的操作(如Shell命令、文件编辑、部署),标准的安全模式是人机协同审批机制:高风险操作暂停并等待人类确认。我们认为审批机制本身并非难点,真正的挑战在于判断环节——即决定哪些操作应被阻止——而该领域的评估基于两个错误假设:存在“风险”的客观真值,以及人类评审者是具有无限可用性的完美预测器。基于125个经过对抗性权重调整的代理操作人工标注数据集,我们表明:(i)评审者之间对何为高风险仅存在中等程度共识(Fleiss' kappa=0.52),故不存在单一正确标签;(ii)将门控机制建模为非对称成本下的选择性分类,可使其运行边界可量化,且面对困难输入时门控无法安全地自主决策;(iii)当评审者被建模为内生变量(随升级负载增加而疲劳)时,实际安全性随升级率呈倒U型曲线:更多人类监督可能降低系统安全性,而安全最优的门控机制在未达到完全升级时即进行干预——该设置同样被负载感知策略用于抵御洪水攻击,防止恶意操作在评审者疲劳时通过审批。按此框架,代理监督不仅是分类问题,更是资源分配问题:人类注意力有限,而门控的升级策略消耗这种资源。我们声明上述机制均非原创——疲劳感知的延迟决策(FALCON)、工作负载约束下的成本敏感延迟决策(DeCCaF)、轨迹级门控,以及评审者疲劳/洪水攻击均属已有文献。我们的贡献在于提供一个开源代理监督系统,该系统在LLM代理操作门控场景中实现并量化上述机制,将“我的门控是否可靠”从猜测转化为曲线分析。倒U型曲线与洪水攻击的建模结果构成了引导人类实证研究的理论基础。