AI-NativeBench: An Open-Source White-Box Agentic Benchmark Suite for AI-Native Systems

The transition from Cloud-Native to AI-Native architectures is fundamentally reshaping software engineering, replacing deterministic microservices with probabilistic agentic services. However, this shift renders traditional black-box evaluation paradigms insufficient: existing benchmarks measure raw model capabilities while remaining blind to system-level execution dynamics. To bridge this gap, we introduce AI-NativeBench, the first application-centric and white-box AI-Native benchmark suite grounded in Model Context Protocol (MCP) and Agent-to-Agent (A2A) standards. By treating agentic spans as first-class citizens within distributed traces, our methodology enables granular analysis of engineering characteristics beyond simple capabilities. Leveraging this benchmark across 21 system variants, we uncover critical engineering realities invisible to traditional metrics: a parameter paradox where lightweight models often surpass flagships in protocol adherence, a pervasive inference dominance that renders protocol overhead secondary, and an expensive failure pattern where self-healing mechanisms paradoxically act as cost multipliers on unviable workflows. This work provides the first systematic evidence to guide the transition from measuring model capability to engineering reliable AI-Native systems. To facilitate reproducibility and further research, we have open-sourced the benchmark and dataset.

翻译：从云原生向AI原生架构的转型正在从根本上重塑软件工程，以概率性智能体服务取代确定性微服务。然而，这一转变使得传统的黑盒评估范式不再适用：现有基准测试仅衡量原始模型能力，却无法洞察系统级执行动态。为弥补这一不足，我们提出了AI-NativeBench——首个基于模型上下文协议（MCP）与智能体间通信（A2A）标准、以应用为中心的白盒式AI原生基准测试套件。通过将智能体执行轨迹视为分布式追踪中的一等公民，我们的方法能够对超越简单能力的工程特性进行细粒度分析。通过对21种系统变体进行基准测试，我们揭示了传统指标无法观测的关键工程现实：参数悖论（轻量模型在协议遵循性上常优于旗舰模型）、普遍存在的推理主导现象（使协议开销退居次要地位），以及昂贵故障模式（自愈机制在不可行工作流中反而成为成本倍增器）。本工作首次提供了系统性证据，以指导从模型能力评估向可靠AI原生系统工程实践的转变。为促进可复现性与进一步研究，我们已开源基准测试套件及相关数据集。

相关内容

关注 7103

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

构建面向终端的 AI 编程智能体：脚手架、测试环境、上下文工程及实践经验

专知会员服务

20+阅读 · 3月8日

通用智能体评估的逻辑架构

专知会员服务

21+阅读 · 2月28日

AI 智能体系统：体系架构、应用场景及评估范式

专知会员服务

67+阅读 · 1月6日

AI原生应用生态白皮书（2024）

专知会员服务

73+阅读 · 2024年4月13日