Many applications must provide low-latency LLM service to users or risk unacceptable user experience. However, over-provisioning resources to serve fluctuating request patterns is often prohibitively expensive. In this work, we present a best-effort serving system that employs deep reinforcement learning to adjust service quality based on the task distribution and system load. Our best-effort system can maintain availability with over 10x higher client request rates, serves above 96% of peak performance 4.1x more often, and serves above 98% of peak performance 2.3x more often than static serving on unpredictable workloads. Our learned router is robust to shifts in both the arrival and task distribution. Compared to static serving, learned best-effort serving allows for cost-efficient serving through increased hardware utility. Additionally, we argue that learned best-effort LLM serving is applicable in wide variety of settings and provides application developers great flexibility to meet their specific needs.
翻译:许多应用程序必须为用户提供低延迟的LLM服务,否则可能面临不可接受的用户体验。然而,为应对波动的请求模式而过度配置资源往往成本过高。在本工作中,我们提出了一种最佳努力服务系统,该系统采用深度强化学习根据任务分布和系统负载调整服务质量。与静态服务相比,我们的最佳努力系统在不可预测的工作负载下,能够以超过10倍更高的客户端请求率维持可用性,高出峰值性能96%以上的服务频率提升4.1倍,高出峰值性能98%以上的服务频率提升2.3倍。我们学习型路由器对到达分布和任务分布的偏移均具有鲁棒性。与静态服务相比,学习型最佳努力服务通过提高硬件利用率实现了成本效益服务。此外,我们认为学习型最佳努力LLM服务适用于多种场景,并为应用开发者提供了极大的灵活性以满足其特定需求。