MeanCache: User-Centric Semantic Cache for Large Language Model Based Web Services

from arxiv, This study presents the first privacy aware semantic cache for LLMs based on Federated Learning. MeanCache is the first cache that can handle contextual queries efficiently. Total pages 14

Large Language Models (LLMs) like ChatGPT and Llama have revolutionized natural language processing and search engine dynamics. However, these models incur exceptionally high computational costs. For instance, GPT-3 consists of 175 billion parameters, where inference demands billions of floating-point operations. Caching is a natural solution to reduce LLM inference costs on repeated queries, which constitute about 31% of the total queries. However, existing caching methods are incapable of finding semantic similarities among LLM queries nor do they operate on contextual queries, leading to unacceptable false hit-and-miss rates. This paper introduces MeanCache, a user-centric semantic cache for LLM-based services that identifies semantically similar queries to determine cache hit or miss. Using MeanCache, the response to a user's semantically similar query can be retrieved from a local cache rather than re-querying the LLM, thus reducing costs, service provider load, and environmental impact. MeanCache leverages Federated Learning (FL) to collaboratively train a query similarity model without violating user privacy. By placing a local cache in each user's device and using FL, MeanCache reduces the latency and costs and enhances model performance, resulting in lower false hit rates. MeanCache also encodes context chains for every cached query, offering a simple yet highly effective mechanism to discern contextual query responses from standalone. Our experiments benchmarked against the state-of-the-art caching method, reveal that MeanCache attains an approximately 17% higher F-score and a 20% increase in precision during semantic cache hit-and-miss decisions while performing even better on contextual queries. It also reduces the storage requirement by 83% and accelerates semantic cache hit-and-miss decisions by 11%.

翻译：以ChatGPT和Llama为代表的大型语言模型（LLMs）彻底改变了自然语言处理和搜索引擎的运作范式。然而，这些模型带来了极高的计算成本。例如，GPT-3包含1750亿参数，其单次推理需要数百亿次浮点运算。缓存是降低重复查询（约占总查询量的31%）中LLM推理成本的天然解决方案。但现有缓存方法既无法识别LLM查询间的语义相似性，也无法处理上下文关联查询，导致缓存命中/未命中判断的错误率居高不下。本文提出MeanCache，一种面向LLM服务的用户中心化语义缓存系统，能够通过识别语义相似查询来决定缓存命中与否。使用MeanCache时，用户语义相似查询的响应可直接从本地缓存获取，无需重新查询LLM，从而降低计算成本、减轻服务提供商负载并减少环境影响。MeanCache采用联邦学习（FL）技术，在保护用户隐私的前提下协同训练查询相似性模型。通过在用户设备部署本地缓存并运用FL，MeanCache有效降低了查询延迟与成本，提升了模型性能，从而显著降低误判率。该系统还为每个缓存查询构建上下文链编码，提供了一种简洁高效的方法来区分上下文关联查询与独立查询的响应。实验结果表明：相较于当前最先进的缓存方法，MeanCache在语义缓存命中/未命中决策中的F值提升约17%，精确度提高20%，在上下文查询场景中表现更为优异。同时，该系统将存储需求降低83%，并将语义缓存决策速度提升11%。