Advancements in large language models (LLMs) have unlocked remarkable capabilities. While deploying these models typically requires server-grade GPUs and cloud-based inference, the recent emergence of smaller open-source models and increasingly powerful consumer devices have made on-device deployment practical. The web browser as a platform for on-device deployment is universally accessible, provides a natural agentic environment, and conveniently abstracts out the different backends from diverse device vendors. To address this opportunity, we introduce WebLLM, an open-source JavaScript framework that enables high-performance LLM inference entirely within web browsers. WebLLM provides an OpenAI-style API for seamless integration into web applications, and leverages WebGPU for efficient local GPU acceleration and WebAssembly for performant CPU computation. With machine learning compilers MLC-LLM and Apache TVM, WebLLM leverages optimized WebGPU kernels, overcoming the absence of performant WebGPU kernel libraries. Evaluations show that WebLLM can retain up to 80% native performance on the same device, with room to further close the gap. WebLLM paves the way for universally accessible, privacy-preserving, personalized, and locally powered LLM applications in web browsers. The code is available at: https://github.com/mlc-ai/web-llm.
翻译:大型语言模型(LLM)的进展已释放出卓越的能力。尽管部署这些模型通常需要服务器级GPU和基于云的推理,但近期涌现的小型开源模型以及日益强大的消费级设备使得端侧部署成为可能。Web浏览器作为端侧部署平台具有普适可访问性,提供了自然的智能体环境,并能便捷地抽象不同设备厂商的异构后端。为把握这一机遇,我们推出WebLLM——一个完全在Web浏览器内实现高性能LLM推理的开源JavaScript框架。WebLLM提供OpenAI风格的API以无缝集成至Web应用,并利用WebGPU实现高效的本地GPU加速,通过WebAssembly执行高性能CPU计算。借助机器学习编译器MLC-LLM和Apache TVM,WebLLM利用优化的WebGPU内核,克服了当前缺乏高性能WebGPU内核库的局限。评估表明,WebLLM能在相同设备上保持最高80%的原生性能,且存在进一步缩小差距的空间。WebLLM为在Web浏览器中实现普适可访问、隐私保护、个性化及本地驱动的LLM应用铺平了道路。代码已开源:https://github.com/mlc-ai/web-llm。