Global KV-cache sharing is an effective optimization for accelerating large language model (LLM) inference, yet it introduces an API-visible timing side channel that lets adversaries infer sensitive user inputs from shared entries, leading to cross-tenant privacy risks. To address this problem, we introduce SafeKV (Secure and Flexible KV-cache Sharing), a system-level co-design of privacy enforcement and KV-cache management. SafeKV integrates lightweight detection and isolation directly into the serving runtime to eliminate cross-tenant reuse of sensitive KV-cache blocks under our threat model, while recovering most of the performance benefits of global sharing. Our key contributions are: (1) a three-tier asynchronous detection pipeline that decouples privacy classification from inference and supports streaming workloads, (2) a unified radix-tree-based memory manager with path compression and sensitivity-aware eviction for scalable selective isolation, and (3) an RDR-guided (Reuse Diversity Ratio) runtime safeguard that detects and bounds residual leakage. On large LLM backends, SafeKV reduces the time-to-first-token (TTFT) overhead compared to full isolation by up to 40.58% and raises throughput by up to 2.66x. Overall, SafeKV restores the efficiency of KV reuse while enforcing strong, practical privacy for multi-tenant LLM inference.
翻译:全局KV缓存共享是加速大语言模型(LLM)推理的有效优化技术,但它引入了API可见的时序侧信道,使得攻击者能够从共享条目推断敏感用户输入,从而导致跨租户隐私风险。为解决此问题,我们提出了SafeKV(安全灵活的KV缓存共享系统),一种隐私强制与KV缓存管理的系统级协同设计。SafeKV将轻量级检测与隔离机制直接集成至服务运行时,在我们的威胁模型下消除敏感KV缓存块的跨租户复用,同时恢复全局共享的大部分性能优势。我们的核心贡献包括:(1)三层异步检测流水线,将隐私分类与推理解耦并支持流式工作负载;(2)基于基数树的统一内存管理器,采用路径压缩和敏感度感知淘汰策略,实现可扩展的选择性隔离;(3)RDR引导(复用多样性比率)的运行时防护机制,用于检测并限制残余泄漏。在大型LLM后端上,SafeKV相比完全隔离方案将首词生成时间(TTFT)开销降低达40.58%,并将吞吐量提升达2.66倍。总体而言,SafeKV在恢复KV复用效率的同时,为多租户LLM推理提供了强健且实用的隐私保障。