Large Language Models face an emerging and critical threat known as latency attacks. Because LLM inference is inherently expensive, even modest slowdowns can translate into substantial operating costs and severe availability risks. Recently, a growing body of research has focused on algorithmic complexity attacks by crafting inputs to trigger worst-case output lengths. However, we report a counter-intuitive finding that these algorithmic latency attacks are largely ineffective against modern LLM serving systems. We reveal that system-level optimization such as continuous batching provides a logical isolation to mitigate contagious latency impact on co-located users. To this end, in this paper, we shift the focus from the algorithm to the system layer, and introduce a new Fill and Squeeze attack strategy targeting the state transition of the scheduler. "Fill" first exhausts the global KV cache to induce Head-of-Line blocking, while "Squeeze" forces the system into repetitive preemption. By manipulating output lengths using methods from simple plain-text prompts to more complex prompt engineering, and leveraging side-channel probing of memory status, we demonstrate that the attack can be orchestrated in a black-box setting with much less cost. Extensive evaluations indicate by up to 20-280x average slowdown on Time to First Token and 1.5-4x average slowdown on Time Per Output Token compared to existing attacks with 30-40% lower attack cost.
翻译:大型语言模型面临一种新兴且严峻的威胁,称为延迟攻击。由于LLM推理本质上是计算密集型的,即使轻微的减速也可能转化为巨大的运营成本和严重的可用性风险。近期,越来越多的研究聚焦于通过构造输入来触发最坏情况输出长度的算法复杂度攻击。然而,我们报告了一个反直觉的发现:这些算法延迟攻击对现代LLM服务系统基本无效。我们揭示,诸如连续批处理等系统级优化提供了逻辑隔离,能够缓解延迟对共置用户的传染性影响。为此,本文我们将焦点从算法层转向系统层,并引入一种针对调度器状态转换的新型“填充与挤压”攻击策略。“填充”首先耗尽全局KV缓存以诱发队头阻塞,而“挤压”则迫使系统陷入重复的抢占状态。通过使用从简单纯文本提示到更复杂提示工程的方法操控输出长度,并利用内存状态的侧信道探测,我们证明该攻击可以在黑盒环境中以低得多的成本实施。广泛的评估表明,与现有攻击相比,该攻击在首词生成时间上可实现高达20-280倍的平均减速,在每输出词时间上可实现1.5-4倍的平均减速,同时攻击成本降低30-40%。