Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers

WebGPU's security-focused design imposes per-operation validation that compounds across the many small dispatches in neural network inference, yet the true cost of this overhead is poorly characterized. We present a systematic characterization of WebGPU dispatch overhead for LLM inference at batch size 1, spanning four GPU vendors (NVIDIA, AMD, Apple, Intel), two native implementations (Dawn, wgpu-native) and three browsers (Chrome, Safari, Firefox), and two model sizes (Qwen2.5-0.5B and 1.5B). Our primary contribution is a sequential-dispatch methodology that reveals naive single-operation benchmarks overestimate dispatch cost by ${\sim}20\times$. The true per-dispatch cost of WebGPU API overhead alone is 24-36 $μ$s on Vulkan and 32-71 $μ$s on Metal, while the total per-operation overhead including Python cost is ${\sim}95$~$μ$s, which turns out to be a distinction critical for optimization. On Vulkan, kernel fusion improves throughput by 53%, while CUDA fusion provides no benefit, confirming that per-operation overhead is a primary differentiator. LLM inference was tested across three major operating systems (Linux, Windows, macOS). We built $\texttt{torch-webgpu}$, a PrivateUse1-based out-of-tree PyTorch backend and an FX-to-WebGPU compiler, which on our reference platform achieves 11--12% of CUDA performance. At dtype-matched float32, RTX PRO 2000 achieves 1.4$\times$ WebGPU's throughput despite ${\sim}6\times$ less compute than RTX 5090. For dispatch overhead, backend choice is the dominant factor, although implementation choice also matters substantially within a backend (2.2$\times$ for Metal). In terms of dispatch vs kernel compute efficiency, we conclude that at batch=1 with the current dispatch-heavy pipeline, per-operation overhead dominates regardless of kernel quality. All code, benchmarks, and raw data are open source.

翻译：WebGPU基于安全性的设计在每次操作前执行验证，这种验证负载在神经网络推理的多次小规模调度中叠加累积，然而该开销的真实代价尚未得到充分刻画。本文系统性地表征了批大小=1时LLM推理的WebGPU调度开销，覆盖四类GPU厂商（NVIDIA、AMD、Apple、Intel）、两种原生实现（Dawn、wgpu-native）与三款浏览器（Chrome、Safari、Firefox），以及两种模型规模（Qwen2.5-0.5B与1.5B）。我们的核心贡献在于提出一种顺序调度方法，该方法揭示：朴素的单操作基准测试对调度成本高估了约${\sim}20$倍。仅考虑WebGPU API开销时，Vulkan上的真实每操作调度成本为24-36$~μ$s，Metal上为32-71$~μ$s，而包含Python开销的每操作总成本为${\sim}95$~$μ$s——这一差异对优化决策至关重要。在Vulkan平台，算子融合使吞吐量提升53%，而CUDA融合未带来任何收益，证实了每操作开销是主要差异化因素。LLM推理测试在三大主流操作系统（Linux、Windows、macOS）上完成。我们构建了$\texttt{torch-webgpu}$——基于PrivateUse1的PyTorch外挂后端及FX-to-WebGPU编译器，在参考平台上实现了CUDA性能的11-12%。在数据类型匹配的float32精度下，RTX PRO 2000尽管算力仅为RTX 5090的${\sim}6$倍，却达到WebGPU吞吐量的1.4$\times$。在调度开销方面，后端选择是主导因素，但同一后端内的实现方式也存在显著差异（Metal上达2.2$\times$）。针对调度与内核计算效率的关系，我们得出以下结论：在批大小=1且采用当前调度密集型流水线时，无论内核质量如何，每操作开销均占据主导地位。所有代码、基准测试及原始数据均已开源。