Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved without another forward pass through the LLM, has emerged as one possible solution. Traditional exact-match caching, however, overlooks the semantic similarity between queries, leading to unnecessary recomputation. Semantic caching addresses this by retrieving responses based on semantic similarity, but introduces a fundamentally different cache eviction problem: one must account for mismatch costs between incoming queries and cached responses. Moreover, key system parameters, such as query arrival probabilities and serving costs, are often unknown and must be learned over time. Existing semantic caching methods are largely ad-hoc, lacking theoretical foundations and unable to adapt to real-world uncertainty. In this paper, we present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions. We formulate both offline optimization and online learning variants of the problem, and develop provably efficient algorithms with state-of-the-art guarantees. We also evaluate our framework on a synthetic dataset, showing that our proposed algorithms perform matching or superior performance compared with baselines.
翻译:大型语言模型(LLM)正在彻底改变用户与信息系统的交互方式,但其高昂的推理成本带来了严重的可扩展性与可持续性挑战。缓存推理结果,使得无需再次通过LLM进行前向传播即可检索响应,已成为一种可能的解决方案。然而,传统的精确匹配缓存忽略了查询之间的语义相似性,导致不必要的重复计算。语义缓存通过基于语义相似性检索响应来解决这一问题,但引入了一个本质上不同的缓存淘汰问题:必须考虑传入查询与缓存响应之间的不匹配成本。此外,关键系统参数(如查询到达概率和服务成本)通常是未知的,必须随时间学习。现有的语义缓存方法大多为临时方案,缺乏理论基础且无法适应现实世界的不确定性。本文提出了一种基于原理的学习框架,用于在未知查询与成本分布下进行语义缓存淘汰。我们构建了该问题的离线优化与在线学习两种变体,并开发了具有最先进理论保证的高效可证算法。我们还在合成数据集上评估了该框架,结果表明所提算法相较于基线方法具有相当或更优的性能。