Generating texts with a large language model (LLM) consumes massive amounts of memory. Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence as they do not know the output sequence length. This restricts us to use a smaller batch size leading to lower GPU utilization and above all, lower throughput. We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem. To this end, we propose S$^{3}$, which predicts the output sequence length, schedules generation queries based on the prediction to increase device resource utilization and throughput, and handle mispredictions. Our proposed method achieves 6.49$\times$ throughput over those systems that assume the worst case for the output sequence length.
翻译:使用大型语言模型(LLM)生成文本会消耗大量内存。除了已经庞大的模型参数外,用于保存序列中先前令牌信息的键值(KV)缓存可能增长至比模型本身更大。这一问题在当前LLM服务框架中尤为突出——由于无法预知输出序列长度,这些框架会为KV缓存预留最大序列长度的内存以确保生成完整序列。这迫使我们使用更小的批次大小,导致GPU利用率降低,进而降低吞吐量。我们认为,设计一个预先知晓输出序列长度的系统可以缓解此问题。为此,我们提出S$^{3}$,该方法预测输出序列长度,基于预测结果调度生成查询以提升设备资源利用率和吞吐量,并处理预测错误。实验表明,我们提出的方法相比那些假设输出序列长度为最坏情况的系统,实现了6.49倍的吞吐量提升。