We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context. Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. The schema ensures positional accuracy during attention state reuse and provides users with an interface to access cached states in their prompt. Using a prototype implementation, we evaluate Prompt Cache across several LLMs. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.
翻译:我们提出Prompt Cache,一种通过跨不同LLM提示重用注意力状态来加速大型语言模型推理的方法。许多输入提示包含重叠文本片段,例如系统消息、提示模板和提供上下文的文档。我们的关键洞察是:通过在推理服务器上预计算并存储这些频繁出现的文本片段的注意力状态,当这些片段出现在用户提示中时,我们可以高效重用它们。Prompt Cache采用一种模式来明确定义此类可重用文本片段(称为提示模块)。该模式确保注意力状态重用期间的位置准确性,并为用户提供访问其提示中缓存状态的接口。通过原型实现,我们在多个LLM上评估Prompt Cache。我们表明,Prompt Cache显著降低了首个令牌生成时间的延迟,尤其是对于基于文档的问答和推荐等较长提示。改进幅度从基于GPU推理的8倍到基于CPU推理的60倍,同时保持输出精度且无需修改模型参数。