Large language models (LLMs) now sit in the critical path of search, assistance, and agentic workflows, making semantic caching essential for reducing inference cost and latency. Production deployments typically use a tiered static-dynamic design: a static cache of curated, offline vetted responses mined from logs, backed by a dynamic cache populated online. In practice, both tiers are commonly governed by a single embedding similarity threshold, which induces a hard tradeoff: conservative thresholds miss safe reuse opportunities, while aggressive thresholds risk serving semantically incorrect responses. We introduce \textbf{Krites}, an asynchronous, LLM-judged caching policy that expands static coverage without changing serving decisions. On the critical path, Krites behaves exactly like a standard static threshold policy. When the nearest static neighbor of the prompt falls just below the static threshold, Krites asynchronously invokes an LLM judge to verify whether the static response is acceptable for the new prompt. Approved matches are promoted into the dynamic cache, allowing future repeats and paraphrases to reuse curated static answers and expanding static reach over time. In trace-driven simulations on conversational and search workloads, Krites increases the fraction of requests served with curated static answers (direct static hits plus verified promotions) by up to $\textbf{3.9}$ times for conversational traffic and search-style queries relative to tuned baselines, with unchanged critical path latency.
翻译:大型语言模型(LLM)现已成为搜索、辅助及智能体工作流中的关键环节,这使得语义缓存对于降低推理成本与延迟至关重要。生产部署通常采用分层静态-动态设计:一个由日志挖掘、经离线审核的精选响应构成的静态缓存,并由在线填充的动态缓存作为后备。实践中,这两个层级通常由单一的嵌入相似度阈值控制,这导致了一种艰难的权衡:保守的阈值会错失安全复用的机会,而激进的阈值则可能提供语义错误的响应。我们提出 \textbf{Krites},一种基于LLM判定的异步缓存策略,它能在不改变服务决策的前提下扩展静态缓存的覆盖范围。在关键路径上,Krites 的行为与标准的静态阈值策略完全一致。当提示词的最邻近静态邻居略低于静态阈值时,Krites 会异步调用一个 LLM 判定器,以验证该静态响应对于新提示词是否可接受。经批准的匹配项会被提升至动态缓存中,使得未来的重复及转述查询能够复用精选的静态答案,从而随时间推移扩展静态缓存的覆盖范围。在基于对话和搜索工作负载的跟踪驱动模拟中,相对于调优的基线,Krites 将使用精选静态答案(直接静态命中加已验证提升)服务的请求比例提升了高达 $\textbf{3.9}$ 倍(针对对话流量和搜索式查询),且关键路径延迟保持不变。