Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation

Large Language Model (LLM)-based agents are increasingly adopted in high-stakes settings, but current benchmarks evaluate mainly whether a task was completed, not how. We introduce Procedure-Aware Evaluation (PAE), a framework that formalizes agent procedures as structured observations and exposes consistency relationships between what agents observe, communicate, and execute. PAE evaluates agents along complementary axes (Utility, Efficiency, Interaction Quality, Procedural Integrity) and applies multi-dimensional gating that categorically disqualifies corrupt outcomes. Evaluating state-of-the-art LLM agents on tau-bench yields findings at the axis, compliance, and benchmark levels. At the axis level, the dimensions capture non-redundant failure modes: utility masks reliability gaps, speed does not imply precision, and conciseness does not predict intent adherence. At the procedural compliance level, 27-78% of benchmark reported successes are corrupt successes concealing violations across interaction and integrity. Furthermore, gating substantially collapses Pass^4 rate and affects model rankings. The analysis of corrupt success cases reveals distinctive per-model failure signatures: GPT-5 spreads errors across policy, execution, and intent dimensions; Kimi-K2-Thinking concentrates 78% of violations in policy faithfulness and compliance; and Mistral-Large-3 is dominated by faithfulness failures. At the benchmark level, our analysis exposes structural flaws in the benchmark design, including task scope gaps, contradictory reward signals, and simulator artifacts that produce accidental successes.

翻译：基于大型语言模型（LLM）的代理日益应用于高风险场景，但当前基准测试主要评估任务是否完成，而非其完成过程。本文提出过程感知评估框架，该框架将代理过程形式化为结构化观测，并揭示代理在观察、交流与执行之间的一致性关系。PAE从互补维度（效用性、效率性、交互质量、过程完整性）评估代理，并采用多维门控机制对虚假结果进行系统性剔除。在tau-bench基准上对前沿LLM代理的评估揭示了维度、合规性及基准层面的关键发现：在维度层面，各维度捕捉到非冗余的失效模式——效用性掩盖可靠性缺陷，速度不意味着精确度，简洁性无法预测意图遵循度；在过程合规层面，基准测试中27-78%的“成功”属于隐藏着交互与完整性违规的虚假成功；门控机制显著降低Pass^4率并影响模型排名。对虚假成功案例的分析揭示了各模型的典型失效特征：GPT-5的错误广泛分布于策略、执行与意图维度；Kimi-K2-Thinking的违规集中在策略忠实度与合规性；Mistral-Large-3则以忠实度失效为主导。在基准层面，分析暴露了基准设计的结构性缺陷，包括任务范围缺失、矛盾奖励信号及导致偶然成功的模拟器伪影。