With large language models (LLMs) widely deployed in long content generation recently, there has emerged an increasing demand for efficient long-sequence inference support. However, key-value (KV) cache, which is stored to avoid re-computation, has emerged as a critical bottleneck by growing linearly in size with the sequence length. Due to the auto-regressive nature of LLMs, the entire KV cache will be loaded for every generated token, resulting in low utilization of computational cores and high latency. While various compression methods for KV cache have been proposed to alleviate this issue, they suffer from degradation in generation quality. We introduce TriForce, a hierarchical speculative decoding system that is scalable to long sequence generation. This approach leverages the original model weights and dynamic sparse KV cache via retrieval as a draft model, which serves as an intermediate layer in the hierarchy and is further speculated by a smaller model to reduce its drafting latency. TriForce not only facilitates impressive speedups for Llama2-7B-128K, achieving up to 2.31$\times$ on an A100 GPU but also showcases scalability in handling even longer contexts. For the offloading setting on two RTX 4090 GPUs, TriForce achieves 0.108s/token$\unicode{x2014}$only half as slow as the auto-regressive baseline on an A100, which attains 7.78$\times$ on our optimized offloading system. Additionally, TriForce performs 4.86$\times$ than DeepSpeed-Zero-Inference on a single RTX 4090 GPU. TriForce's robustness is highlighted by its consistently outstanding performance across various temperatures. The code is available at https://github.com/Infini-AI-Lab/TriForce.
翻译:随着大语言模型(LLMs)近期在长内容生成中的广泛部署,对高效长序列推理支持的需求日益增长。然而,为避免重复计算而存储的键值(KV)缓存,因随序列长度线性增长而成为关键瓶颈。由于LLMs的自回归特性,每个生成令牌都需要加载整个KV缓存,导致计算核心利用率低且延迟高。尽管已提出多种KV缓存压缩方法缓解此问题,但它们会导致生成质量下降。我们提出TriForce,一种可扩展至长序列生成的分层推测解码系统。该方法利用原始模型权重和通过检索生成的动态稀疏KV缓存作为草稿模型,该草稿模型作为分层结构中的中间层,并由更小模型进一步推测以降低其草稿生成延迟。TriForce不仅能在A100 GPU上实现Llama2-7B-128K高达2.31倍的显著加速,还展示了处理更长上下文场景的可扩展性。在两块RTX 4090 GPU的卸载设置中,TriForce达到0.108秒/令牌——仅为A100自回归基线速度的一半,在优化后的卸载系统上实现7.78倍加速。此外,TriForce在单块RTX 4090 GPU上比DeepSpeed-Zero-Inference快4.86倍。TriForce的鲁棒性体现在其在不同温度下始终如一的优异性能。代码开源地址:https://github.com/Infini-AI-Lab/TriForce。