Prompt tuning has become a prominent strategy for enhancing the performance of Large Language Models (LLMs) on downstream tasks. Many IT enterprises now offer Prompt-Tuning-as-a-Service to fulfill the growing demand for prompt tuning LLMs on downstream tasks. Their primary objective is to satisfy users Service Level Objectives (SLOs) while reducing resource provisioning costs. Nevertheless, our characterization analysis for existing deep learning resource management systems reveals that they are insufficient to optimize these objectives for LLM prompt tuning workloads. In this paper, we introduce PromptTuner, an SLO-aware elastic system to optimize LLM prompt tuning. It contains two innovations. (1) We design a Prompt Bank to identify efficient initial prompts to expedite the convergence of prompt tuning. (2) We develop aWorkload Scheduler to enable fast resource allocation to reduce the SLO violation and resource costs. In our evaluation, PromptTuner reduces SLO violations by 4.0x and 7.9x, and lowers costs by 1.6x and 4.5x, compared to INFless and ElasticFlow respectively.
翻译:提示调优已成为提升大型语言模型在下游任务性能的重要策略。当前许多IT企业提供提示调优即服务,以满足下游任务中LLM提示调优日益增长的需求。其主要目标是在降低资源供给成本的同时,满足用户的服务级别目标要求。然而,我们对现有深度学习资源管理系统的特征分析表明,这些系统不足以优化LLM提示调优工作负载的上述目标。本文提出PromptTuner——一种面向SLO的弹性系统,用于优化LLM提示调优。该系统包含两项创新:(1) 我们设计了提示库,通过识别高效的初始提示来加速提示调优的收敛过程;(2) 我们开发了工作负载调度器,实现快速资源分配以减少SLO违规并降低资源成本。实验评估表明,与INFless和ElasticFlow相比,PromptTuner分别将SLO违规降低了4.0倍和7.9倍,并将成本降低了1.6倍和4.5倍。