Cortex: Achieving Low-Latency, Cost-Efficient Remote Data Access For LLM via Semantic-Aware Knowledge Caching

Large Language Model (LLM) agents tackle data-intensive tasks such as deep research and code generation. However, their effectiveness depends on frequent interactions with knowledge sources across remote clouds or regions. Such interactions can create non-trivial latency and cost bottlenecks. Existing caching solutions focus on exact-match queries, limiting their effectiveness for semantic knowledge reuse. To address this challenge, we introduce Cortex, a novel cross-region knowledge caching architecture for LLM agents. At its core are two abstractions: Semantic Element (SE) and Semantic Retrieval Index (Seri). A semantic element captures the semantic embedding representation of an LLM query together with performance-aware metadata such as latency, cost, and staticity. Seri then provides two-stage retrieval: a vector similar index with semantic embedding for fast candidate selection and a lightweight LLM-powered semantic judger for precise validation. Atop these primitives, Cortex builds a new cache interface that includes a new semantic-aware cache hit definition, a cost-efficient eviction policy, and proactive prefetching. To reduce overhead, Cortex co-locates the small LLM judger with the main LLM using adaptive scheduling and resource sharing. Our evaluation demonstrates that Cortex delivers substantial performance improvements without compromising correctness. On representative search workloads, Cortex achieves up to a 3.6x increase in throughput by maintaining cache hit rates of over 85%, while preserving accuracy virtually identical to non-cached baselines. Cortex also improves throughput for coding tasks by 20%, showcasing its versatility across diverse agentic workloads.

翻译：大语言模型（LLM）智能体处理诸如深度研究和代码生成等数据密集型任务。然而，其效能依赖于与跨远程云或区域的知识源进行频繁交互。此类交互可能产生显著的延迟和成本瓶颈。现有的缓存解决方案侧重于精确匹配查询，限制了其在语义知识复用方面的有效性。为应对这一挑战，我们提出了Cortex，一种面向LLM智能体的新型跨区域知识缓存架构。其核心是两个抽象：语义元素（SE）和语义检索索引（Seri）。语义元素捕获LLM查询的语义嵌入表示，以及诸如延迟、成本和静态性等性能感知元数据。Seri随后提供两阶段检索：一个带有语义嵌入的向量相似性索引用于快速候选选择，以及一个轻量级的、由LLM驱动的语义判定器用于精确验证。基于这些原语，Cortex构建了一个新的缓存接口，包括一个新的语义感知缓存命中定义、一个高性价比的淘汰策略以及主动预取。为了降低开销，Cortex通过自适应调度和资源共享，将小型LLM判定器与主LLM协同部署。我们的评估表明，Cortex在不牺牲正确性的前提下带来了显著的性能提升。在代表性的搜索工作负载上，通过维持超过85%的缓存命中率，Cortex实现了高达3.6倍的吞吐量提升，同时保持了与非缓存基线几乎完全相同的准确性。Cortex还将编码任务的吞吐量提高了20%，展示了其在不同智能体工作负载上的通用性。