Large Language Models (LLMs) have revolutionized numerous domains, driving the rise of Language-Model-as-a-Service (LMaaS) platforms that process millions of queries daily. These platforms must minimize latency and meet Service Level Objectives (SLOs) while optimizing resource usage. However, conventional cloud service management techniques, designed for traditional workloads, are suboptimal for LMaaS due to its dynamic service workloads and variable request loads. To address this, we propose PreServe, a tailored LMaaS management framework centered on hierarchical prediction. PreServe incorporates a service workload predictor to estimate periodic token density at a coarse granularity and a novel request load predictor to assess the resource demand of individual LLM requests, enabling the construction of a load anticipator for each LLM instance. By integrating both long-term and short-term predictions, PreServe adjusts resource allocation in advance, mitigating the risks of instance under- or over-provisioning. Besides, PreServe optimizes request routing by considering both current and anticipated future instance loads, ensuring balanced load distribution across instances. Evaluations on real-world production datasets show that PreServe outperforms state-of-the-art methods, reducing tail latency by 41.3%, cutting resource consumption by 49.38%, while incurring only 0.23% additional overhead.
翻译:大型语言模型(LLM)已在众多领域引发革命性变革,推动了语言模型即服务(LMaaS)平台的兴起,这些平台每日需处理数百万次查询。此类平台必须在优化资源使用的同时,最大限度地降低延迟并满足服务水平目标(SLO)。然而,传统的云服务管理技术专为传统工作负载设计,由于LMaaS动态的服务工作负载和可变的请求负载,这些技术对其并不适用。为此,我们提出了PreServe,一个以分层预测为核心的定制化LMaaS管理框架。PreServe包含一个服务工作负载预测器,用于在粗粒度上估计周期性的令牌密度;以及一个新颖的请求负载预测器,用于评估单个LLM请求的资源需求,从而为每个LLM实例构建负载预判器。通过整合长期和短期预测,PreServe能够提前调整资源分配,降低实例资源供给不足或过度的风险。此外,PreServe通过同时考虑当前及未来预期的实例负载来优化请求路由,确保跨实例的负载均衡。在真实生产数据集上的评估表明,PreServe优于现有最优方法,将尾部延迟降低了41.3%,资源消耗减少了49.38%,同时仅产生0.23%的额外开销。