The transition from Cloud-Native to AI-Native architectures is fundamentally reshaping software engineering, replacing deterministic microservices with probabilistic agentic services. However, this shift renders traditional black-box evaluation paradigms insufficient: existing benchmarks measure raw model capabilities while remaining blind to system-level execution dynamics. To bridge this gap, we introduce AI-NativeBench, the first application-centric and white-box AI-Native benchmark suite grounded in Model Context Protocol (MCP) and Agent-to-Agent (A2A) standards. By treating agentic spans as first-class citizens within distributed traces, our methodology enables granular analysis of engineering characteristics beyond simple capabilities. Leveraging this benchmark across 21 system variants, we uncover critical engineering realities invisible to traditional metrics: a parameter paradox where lightweight models often surpass flagships in protocol adherence, a pervasive inference dominance that renders protocol overhead secondary, and an expensive failure pattern where self-healing mechanisms paradoxically act as cost multipliers on unviable workflows. This work provides the first systematic evidence to guide the transition from measuring model capability to engineering reliable AI-Native systems. To facilitate reproducibility and further research, we have open-sourced the benchmark and dataset.
翻译:从云原生向AI原生架构的转型正在从根本上重塑软件工程,以概率性智能体服务取代确定性微服务。然而,这一转变使得传统的黑盒评估范式不再适用:现有基准测试仅衡量原始模型能力,却无法洞察系统级执行动态。为弥补这一不足,我们提出了AI-NativeBench——首个基于模型上下文协议(MCP)与智能体间通信(A2A)标准、以应用为中心的白盒式AI原生基准测试套件。通过将智能体执行轨迹视为分布式追踪中的一等公民,我们的方法能够对超越简单能力的工程特性进行细粒度分析。通过对21种系统变体进行基准测试,我们揭示了传统指标无法观测的关键工程现实:参数悖论(轻量模型在协议遵循性上常优于旗舰模型)、普遍存在的推理主导现象(使协议开销退居次要地位),以及昂贵故障模式(自愈机制在不可行工作流中反而成为成本倍增器)。本工作首次提供了系统性证据,以指导从模型能力评估向可靠AI原生系统工程实践的转变。为促进可复现性与进一步研究,我们已开源基准测试套件及相关数据集。