Pie：面向新兴LLM应用的可编程服务系统 (Pie: A Programmable Serving System for Emerging LLM Applications)

Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3x-3.4x higher) on agentic workflows by enabling application-specific optimizations.

翻译：新兴的大型语言模型（LLM）应用涉及多样化的推理策略和智能体工作流，这对基于单一令牌生成循环构建的现有服务系统提出了能力挑战。本文介绍了Pie，一个为灵活性和效率设计的可编程LLM服务系统。Pie将传统的生成循环分解为通过API暴露的细粒度服务处理器，并将生成过程的控制权委托给用户提供的程序（称为inferlets）。这使得应用能够实现新的KV缓存策略、定制生成逻辑，并在应用内部无缝集成计算与I/O，而无需修改服务系统。Pie使用WebAssembly执行inferlets，受益于其轻量级沙箱隔离。我们的评估表明，Pie在标准任务上达到了最先进的性能（延迟开销为3-12%），同时通过支持应用特定优化，在智能体工作流上显著提升了延迟和吞吐量（提高1.3倍至3.4倍）。