As coding agents move into production workflows, teams need to know not only whether an agent completes a task, but whether its action can be trusted. We show that completion and trustworthiness diverge sharply and systematically. Across 1,750 trajectories on 50 SWE-bench Verified tasks, we compare four frontier models over repeated runs and separate submit rate from test-verified resolve rate. GPT-5 submits a patch on 100% of runs but resolves only 44%; Llama 4 submits on 99% but resolves 18%; and Gemini, despite submitting least often at 70%, resolves more tasks than GPT-5 (50% versus 44%). These gaps are not random: they concentrate in one dangerous failure mode we call silent semantic failure. Qualitatively, on a buggy task the agent submits a plausible-looking patch on all five runs, yet none pass, the same misinterpretation repeated rather than random error. Quantitatively, it dominates failure, covering 80% of Llama 4's failing runs and 68% of GPT-5's, and it is invisible: the outcomes are confidently and consistently wrong, so completion-based and consistency-based monitoring both look healthy exactly when the agent should not be trusted. Lightweight pre-edit prompts do not close the gap. A second probe isolates the instinct to act: given an already-fixed bug, where the right move is to abstain, most models still edit the correct code. This action bias, acting when no action is warranted, is exactly what completion metrics reward. The throughline is measurement: submit rate captures action, but trust requires validity. So evaluation must catch up: score agents by test-verified correctness over repeated runs, report its uncertainty, and reward those that know when not to act.
翻译:暂无翻译