Privacy-Aware Semantic Cache for Large Language Models

Large Language Models (LLMs) like ChatGPT, Google Bard, Claude, and Llama 2 have revolutionized natural language processing and search engine dynamics. However, these models incur exceptionally high computational costs. For instance, GPT-3 consists of 175 billion parameters and inference on these models also demands billions of floating-point operations. Caching is a natural solution to reduce LLM inference costs on repeated queries. However, existing caching methods are incapable of finding semantic similarities among LLM queries, leading to unacceptable false hit-and-miss rates. This paper introduces MeanCache, a semantic cache for LLMs that identifies semantically similar queries to determine cache hit or miss. Using MeanCache, the response to a user's semantically similar query can be retrieved from a local cache rather than re-querying the LLM, thus reducing costs, service provider load, and environmental impact. MeanCache leverages Federated Learning (FL) to collaboratively train a query similarity model in a distributed manner across numerous users without violating privacy. By placing a local cache in each user's device and using FL, MeanCache reduces the latency and costs and enhances model performance, resulting in lower cache false hit rates. Our experiments, benchmarked against the GPTCache, reveal that MeanCache attains an approximately 17% higher F-score and a 20% increase in precision during semantic cache hit-and-miss decisions. Furthermore, MeanCache reduces the storage requirement by 83% and accelerates semantic cache hit-and-miss decisions by 11%, while still surpassing GPTCache.

翻译：大型语言模型（LLMs）如ChatGPT、Google Bard、Claude和Llama 2已彻底改变了自然语言处理和搜索引擎的运作方式。然而，这些模型的计算成本极高。例如，GPT-3包含1750亿个参数，且推理过程需执行数十亿次浮点运算。缓存是降低重复查询中LLM推理成本的天然解决方案。然而，现有缓存方法无法识别LLM查询之间的语义相似性，导致不可接受的误命中与漏判率。本文提出MeanCache——一种面向LLM的语义缓存，通过识别语义相似的查询来决定缓存命中或未命中。利用MeanCache，用户语义相似查询的响应可从本地缓存获取，而无需重新查询LLM，从而降低开销、服务提供商负载及环境影响。MeanCache利用联邦学习（FL）在众多用户间以分布式方式协作训练查询相似性模型，同时不侵犯隐私。通过在每个用户设备中部署本地缓存并采用FL，MeanCache降低了延迟与成本，提升了模型性能，并减少了缓存误命中率。与GPTCache的基准实验表明，MeanCache在语义缓存命中/未命中决策中获得了约17%更高的F值及20%的精确率提升。此外，MeanCache将存储需求降低了83%，并将语义缓存命中/未命中决策速度提升了11%，同时性能仍优于GPTCache。