Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3x-3.4x higher) on agentic workflows by enabling application-specific optimizations.
翻译:新兴的大型语言模型(LLM)应用涉及多样化的推理策略和智能体工作流,这对基于单一令牌生成循环构建的现有服务系统提出了能力挑战。本文介绍了Pie,一个为灵活性和效率设计的可编程LLM服务系统。Pie将传统的生成循环分解为通过API暴露的细粒度服务处理器,并将生成过程的控制权委托给用户提供的程序(称为inferlets)。这使得应用能够实现新的KV缓存策略、定制生成逻辑,并在应用内部无缝集成计算与I/O,而无需修改服务系统。Pie使用WebAssembly执行inferlets,受益于其轻量级沙箱隔离。我们的评估表明,Pie在标准任务上达到了最先进的性能(延迟开销为3-12%),同时通过支持应用特定优化,在智能体工作流上显著提升了延迟和吞吐量(提高1.3倍至3.4倍)。