Large language models (LLMs) have achieved huge success in numerous natural language process (NLP) tasks. However, it faces the challenge of significant resource consumption during inference. In this paper, we aim to improve the inference efficiency of LLMs by prompt caching, i.e., if the current prompt can be answered by the same response of a previous prompt, one can directly utilize that previous response without calling the LLM. Specifically, we focus on the prediction accuracy of prompt caching for single-round question-answering tasks via embedding similarity. The existing embeddings of prompts mostly focus on whether two prompts are semantically similar, which is not necessarily equivalent to whether the same response can answer them. Therefore, we propose a distillation-based method to fine-tune the existing embeddings for better caching prediction. Theoretically, we provide finite-sample guarantees for the convergence of our method under different types of loss functions. Empirically, we carefully construct a hard dataset based on Kwiatkowski et al. (2019) where the existing embedding model (Wang et al., 2022) only achieves an AUC of 0.51. We then fine-tune the above embedding model, which significantly improves the AUC of caching prediction from 0.51 to 0.81. We also conduct simulations demonstrating that our trained models achieve better caching efficiency than the previous embedding model.
翻译:大型语言模型(LLMs)在众多自然语言处理(NLP)任务中取得了巨大成功,但其推理过程面临资源消耗巨大的挑战。本文旨在通过提示缓存(prompt caching)提升LLMs的推理效率:若当前提示可通过之前提示的相同响应来回答,则可直接利用该历史响应而无需调用LLM。具体而言,我们聚焦于基于嵌入相似度实现单轮问答任务中提示缓存的预测准确性。现有提示嵌入主要关注两个提示是否语义相似,但这并不必然等同于它们能否由同一响应作答。为此,我们提出一种基于蒸馏的方法微调现有嵌入以优化缓存预测性能。理论上,我们针对不同损失函数提供了该方法收敛性的有限样本保障。实验方面,我们基于Kwiatkowski等人(2019)的数据精心构建了一个困难数据集,其中现有嵌入模型(Wang等人,2022)的AUC仅为0.51。通过微调上述嵌入模型,缓存预测的AUC从0.51显著提升至0.81。仿真实验表明,我们训练的模型相比原始嵌入模型实现了更优的缓存效率。