Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short-horizon benchmarks that fail to capture the dynamic complexity of real-world production workflows. As a result, benchmark performance may poorly reflect practical capability under realistic runtime environments involving long execution chains, tool interactions, dependency management, and iterative feedback loops. We thus present RAMP, a production-grounded infrastructure for assessing long-horizon software engineering agents. Built upon the YatCC integrated platform, RAMP provides a unified runtime assessment architecture through standardized orchestration and execution interfaces. RAMP introduces realistic compiler-construction workloads with serial dependencies and complex toolchain interactions, together with a staged recovery mechanism for analyzing execution behavior under partial workflow failure. The framework further incorporates utility-oriented multi-dimensional metrics that jointly evaluate outcome quality and process efficiency. We conduct runtime assessments across 15 mainstream models and observe substantial capability degradation that remains largely invisible to conventional isolated benchmarks. Task completion rates progressively collapse across serial workflows, dropping from 100% in the initial stage to only 20% in the final stage, while none of the evaluated models successfully completes the entire pipeline. Runtime analysis reveals systematic failure propagation and significant resource inefficiencies, with computational costs differing by up to three orders of magnitude among comparable models. These findings suggest RAMP advances agentic model evaluation toward continuous, runtime-observable, and production-grounded assessment.

翻译：大语言模型智能体正从编程助手快速演变为自主软件工程系统。然而，现有评估方法仍主要围绕静态、孤立且短周期的基准测试展开，这些测试未能捕捉真实生产工作流的动态复杂性。因此，在涉及长执行链、工具交互、依赖管理及迭代反馈机制的真实运行时环境下，基准测试性能可能难以反映实际能力。为此，我们提出RAMP——一个面向长周期软件工程智能体的生产级评估基础设施。该框架基于YatCC集成平台构建，通过标准化编排与执行接口提供统一的运行时评估架构。RAMP引入具有串行依赖关系和复杂工具链交互的真实编译器构造工作负载，同时配备分阶段恢复机制以分析部分工作流故障下的执行行为。该框架进一步整合了面向效用的多维度评估指标，可联合评估结果质量与过程效率。我们对15个主流模型开展运行时评估，观察到传统孤立基准测试难以察觉的显著能力退化现象：任务完成率在串行工作流中呈渐进式崩溃，从初始阶段100%骤降至最终阶段的20%，且所有评估模型均未能完整执行整个流水线。运行时分析揭示了系统性故障传播与显著的资源低效问题，不同模型间的计算开销差异可达三个数量级。这些发现表明，RAMP正推动智能体模型评估向持续性、运行时可观测且生产环境可验证的方向发展。