As AI agents expand into high-stakes domains like network system operations, evaluating their real-world reliability becomes increasingly critical. However, existing benchmarks risk contamination due to static design, show high statistical variance from limited dataset size, and fail to reflect the complexity of production environments. We present NetArena, a dynamic benchmark generation framework for network applications. NetArena introduces a novel abstraction and unified interface that generalize across diverse tasks, enabling dynamic benchmarking despite the heterogeneity of network workloads. At runtime, users can generate unlimited queries on demand. NetArena integrates with network emulators to measure correctness, safety, and latency during execution. We demonstrate NetArena on three representative applications and find that (1) NetArena significantly improves statistical reliability across AI agents, reducing confidence-interval overlap from 85% to 0, (2) agents achieve only 13-38% average performance (as low as 3%) for large-scale, realistic queries, and (3) it exposes more fine-grained behaviors that static, correctness-only benchmarks miss. NetArena also enables use cases such as SFT and RL fine-tuning on network system tasks. Code is available at https://github.com/Froot-NetSys/NetArena.
翻译:随着AI智能体扩展到网络系统运维等高风险领域,评估其在实际场景中的可靠性变得日益关键。然而,现有基准测试方法因静态设计存在数据污染风险,受限于数据集规模而表现出高统计方差,且无法反映生产环境的复杂性。本文提出NetArena,一个面向网络应用的动态基准测试生成框架。NetArena引入了一种新颖的抽象层和统一接口,能够泛化至多样化任务,从而在网络工作负载异质性的条件下实现动态基准测试。在运行时,用户可按需生成无限量查询。NetArena与网络仿真器集成,可在执行过程中测量正确性、安全性与延迟。我们在三个代表性应用上验证NetArena,发现:(1) NetArena显著提升了跨AI智能体的统计可靠性,将置信区间重叠率从85%降至0;(2) 针对大规模现实查询,智能体平均性能仅为13-38%(最低可达3%);(3) 该方法能揭示静态的纯正确性基准测试所遗漏的更细粒度行为。NetArena还支持网络系统任务的有监督微调(SFT)与强化学习(RL)微调等应用场景。代码发布于https://github.com/Froot-NetSys/NetArena。