Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode

Claude Code's auto mode is the first deployed permission system for AI coding agents, using a two-stage transcript classifier to gate dangerous tool calls. Anthropic reports a 0.4% false positive rate and 17% false negative rate on production traffic. We present the first independent evaluation of this system on deliberately ambiguous authorization scenarios, i.e., tasks where the user's intent is clear but the target scope, blast radius, or risk level is underspecified. Using AmPermBench, a 128-prompt benchmark spanning four DevOps task families and three controlled ambiguity axes, we evaluate 253 state-changing actions at the individual action level against oracle ground truth. Our findings characterize auto mode's scope-escalation coverage under this stress-test workload. The end-to-end false negative rate is 81.0% (95% CI: 73.8%-87.4%), substantially higher than the 17% reported on production traffic, reflecting a fundamentally different workload rather than a contradiction. Notably, 36.8% of all state-changing actions fall outside the classifier's scope via Tier 2 (in-project file edits), contributing to the elevated end-to-end FNR. Even restricting to the 160 actions the classifier actually evaluates (Tier 3), the FNR remains 70.3%, while the FPR rises to 31.9%. The Tier 2 coverage gap is most pronounced on artifact cleanup (92.9% FNR), where agents naturally fall back to editing state files when the expected CLI is unavailable. These results highlight a coverage boundary worth examining: auto mode assumes dangerous actions transit the shell, but agents routinely achieve equivalent effects through file edits that the classifier does not evaluate.

翻译：Claude Code 的自动模式是首个面向AI编程代理部署的权限系统，采用两阶段转录分类器对危险工具调用进行管控。Anthropic 报告其生产流量中误报率为0.4%，漏报率为17%。我们首次在刻意设计的模糊授权场景下对该系统进行独立评估——即用户意图明确，但目标范围、爆炸半径或风险等级未充分指定的任务。基于AmPermBench基准测试（包含128条提示词，覆盖四个DevOps任务家族和三个受控模糊性维度），我们以单动作粒度对253次状态变更操作进行与标准答案的对比评估。研究发现揭示了自动模式在此压力测试工作负载下的范围升级覆盖特性。端到端漏报率达81.0%（95%置信区间：73.8%-87.4%），显著高于生产流量报告的17%，这反映的是本质不同的工作负载特征而非矛盾结果。值得注意的是，36.8%的状态变更操作通过第二层级（项目内文件编辑）规避了分类器的检测范围，导致端到端漏报率升高。即使在分类器实际评估的160次操作（第三层级）中，漏报率仍达70.3%，而误报率升至31.9%。在构件清理任务中，第二层级的覆盖缺口最为显著（漏报率92.9%），此时代理在预期CLI不可用时自然退回到编辑状态文件的操作模式。这些结果揭示了值得关注的覆盖边界：自动模式假设危险操作必须通过Shell传递，但代理经常通过分类器未评估的文件编辑实现等效效果。