During inference for transformer-based large language models (LLM), prefilling is the computation of the key-value (KV) cache for input tokens in the prompt prior to autoregressive generation. For longer input prompt lengths, prefilling will incur a significant overhead on decoding time. In this work, we highlight the following pitfall of prefilling: for batches containing high-varying prompt lengths, significant computation is wasted by the standard practice of padding sequences to the maximum length. As LLMs increasingly support longer context lengths, potentially up to 10 million tokens, variations in prompt lengths within a batch become more pronounced. To address this, we propose Prepacking, a simple yet effective method to optimize prefilling computation. To avoid redundant computation on pad tokens, prepacking combines prompts of varying lengths into a sequence and packs multiple sequences into a compact batch using a bin-packing algorithm. It then modifies the attention mask and positional encoding to compute multiple prefilled KV-caches for multiple prompts within a single sequence. On standard curated dataset containing prompts with varying lengths, we obtain a significant speed and memory efficiency improvements as compared to the default padding-based prefilling computation within Huggingface across a range of base model configurations and inference serving scenarios.
翻译:在基于Transformer的大语言模型推理过程中,预填充是指在自回归生成前对输入提示词中的Token进行键值(KV)缓存计算。当输入提示词长度较长时,预填充将显著增加解码时间。本研究揭示了预填充中的一个关键缺陷:对于包含高度可变提示词长度的批次,标准做法将序列填充至最大长度,从而造成大量计算浪费。随着大语言模型逐步支持更长上下文(最高可达千万级Token),批次内提示词长度的差异愈发显著。为此,我们提出预打包方法——一种简单高效的预填充计算优化方案。该方法通过避免对填充Token的冗余计算,首先将不同长度的提示词合并为序列,并采用装箱算法将多个序列紧凑打包为批次,随后通过修改注意力掩码和位置编码,在单个序列中为多个提示词计算预填充的KV缓存。在包含变长提示词的标准数据集上,与Huggingface基于填充的默认预填充计算方案相比,本方法在多种基础模型配置与推理服务场景下均实现了显著的运算速度与内存效率提升。