Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterogeneous hardware targets. To realize this opportunity, we present Llamas on the Web (LlamaWeb), a WebGPU backend for llama$.$cpp that enables memory-efficient and performance-portable LLM inference across a wide range of model weight formats in the browser. Our design significantly reduces memory overhead through static memory planning and efficient model loading, addresses cross-device variability through a tunable kernel library, and introduces templated GPU kernels that support performant implementations of numerous quantization formats, enabling broad model support and extensibility to new formats. We evaluate LlamaWeb on 16 devices from 8 vendors, collecting data from 10 language models and four model weight formats. We compare LlamaWeb against existing browser-based LLM frameworks and find that LlamaWeb requires 29-33% less memory across several combinations of device, browser, and operating system. We also evaluate LlamaWeb's performance against these frameworks and find that it increases decode throughput by 45-69% across four GPUs from separate vendors. In addition, we compare LlamaWeb's performance against other llama$.$cpp backends, where it is competitive with and even beats vendor-specific backend performance on some devices.
翻译:在浏览器中运行语言模型为构建高效、私密且便携的AI应用提供了独特机遇,但需应对内存资源受限与异构硬件目标的挑战。为实现这一目标,我们提出《网络上的羊驼》(LlamaWeb)——基于llama$.$cpp的WebGPU后端,可在浏览器中实现跨多种模型权重格式的高效内存与跨平台兼容的大语言模型推理。本设计通过静态内存规划与高效模型加载显著降低内存开销,通过可调内核库解决跨设备差异性问题,并引入模板化GPU内核以支持多种量化格式的高性能实现,从而扩展模型兼容性并支持新格式的便捷扩展。我们在来自8家厂商的16款设备上,基于10种语言模型与四种模型权重格式对LlamaWeb进行了评估。与现有浏览器端大语言模型框架相比,LlamaWeb在设备、浏览器及操作系统的多种组合下内存需求降低29-33%;在来自不同厂商的四款GPU上,其解码吞吐量提升45-69%。此外,与llama$.$cpp其他后端的性能对比表明,LlamaWeb在部分设备上甚至超越了厂商专属后端的性能表现。