WirelessBench: A Tolerance-Aware LLM Agent Benchmark for Wireless Network Intelligence

LLM agents are emerging as a key enabler for autonomous wireless network management. Reliably deploying them, however, demands benchmarks that reflect real engineering risk. Existing wireless benchmarks evaluate single isolated capabilities and treat all errors uniformly, missing both cascaded-chain failures and catastrophic unit confusions (\textit{e.g.}, dB vs.\ dBm). We present \wb{}, the first tolerance-aware, tool-integrated benchmark for LLM-based wireless agents. \wb{} is organized as a three-tier cognitive hierarchy: domain knowledge reasoning (WCHW, 1{,}392 items), intent-driven resource allocation (WCNS, 1{,}000 items), and proactive multi-step decisions under mobility (WCMSA, 1{,}000 items). Moreover, \wb{} is established on three design principles: \emph{(i)}~tolerance-aware scoring with catastrophic-error detection; \emph{(ii)}~tool-necessary tasks requiring a 3GPP-compliant ray-tracing query for channel quality; and \emph{(iii)}~Chain-of-Thought (CoT)-traceable items, where every benchmark item ships with a complete CoT trajectory enabling fine-grained diagnosis of where in the reasoning chain an agent fails. Our numerical results show that the direct-prompting model (GPT-4o) scores $68\%$, trailing a tool-integrated agent ($84.64\%$) by $16.64$\,pp; $23\%$ of errors are catastrophic failures invisible to exact-match metrics. More importantly, the hierarchy decomposes errors into four actionable diagnostic categories that flat evaluation cannot reveal. Code and data: https://wirelessbench.github.io/.

翻译：大语言模型智能体正成为实现自主无线网络管理的关键使能技术。然而，可靠部署这些智能体需要能够反映真实工程风险的基准测试。现有无线网络基准测试仅评估单一孤立能力，并将所有错误一视同仁，既忽略了级联链路故障，也忽视了灾难性单位混淆（例如dB与dBm）。我们提出\wb{}，这是首个面向基于大语言模型的无线智能体的容错感知、工具集成基准测试。\wb{}采用三层认知层级架构：领域知识推理（WCHW，1392项）、意图驱动资源分配（WCNS，1000项）以及移动场景下的主动多步决策（WCMSA，1000项）。此外，\wb{}建立在三项设计原则之上：（i）具有灾难性错误检测的容错感知评分机制；（ii）需要符合3GPP标准的射线追踪查询信道质量的工具必要型任务；（iii）思维链可追溯项，即每个基准测试项均配备完整的思维链轨迹，支持对智能体在推理链中何处失败进行细粒度诊断。数值结果表明，直接提示模型（GPT-4o）得分为68%，落后于工具集成智能体（84.64%）16.64个百分点；其中23%的错误属于精确匹配指标无法识别的灾难性失败。更重要的是，该层级结构将错误分解为四个可操作诊断类别，这是平面评估无法揭示的。代码与数据：https://wirelessbench.github.io/。