With large language models (LLMs) widely deployed in long content generation recently, there has emerged an increasing demand for efficient long-sequence inference support. However, key-value (KV) cache, which is stored to avoid re-computation, has emerged as a critical bottleneck by growing linearly in size with the sequence length. Due to the auto-regressive nature of LLMs, the entire KV cache will be loaded for every generated token, resulting in low utilization of computational cores and high latency. While various compression methods for KV cache have been proposed to alleviate this issue, they suffer from degradation in generation quality. We introduce TriForce, a hierarchical speculative decoding system that is scalable to long sequence generation. This approach leverages the original model weights and dynamic sparse KV cache via retrieval as a draft model, which serves as an intermediate layer in the hierarchy and is further speculated by a smaller model to reduce its drafting latency. TriForce not only facilitates impressive speedups for Llama2-7B-128K, achieving up to 2.31$\times$ on an A100 GPU but also showcases scalability in handling even longer contexts. For the offloading setting on two RTX 4090 GPUs, TriForce achieves 0.108s/token$\unicode{x2014}$only half as slow as the auto-regressive baseline on an A100, which attains 7.78$\times$ on our optimized offloading system. Additionally, TriForce performs 4.86$\times$ than DeepSpeed-Zero-Inference on a single RTX 4090 GPU. TriForce's robustness is highlighted by its consistently outstanding performance across various temperatures. The code is available at https://github.com/Infini-AI-Lab/TriForce.
翻译:随着大语言模型(LLMs)近期在长内容生成中的广泛部署,对高效长序列推理支持的需求日益增长。然而,为避免重复计算而存储的键值(KV)缓存,其大小随序列长度线性增长,已成为关键瓶颈。由于LLMs的自回归特性,每个生成令牌均需加载完整KV缓存,导致计算核心利用率低且延迟高。尽管已有多种KV缓存压缩方法被提出以缓解该问题,但这些方法会损害生成质量。我们提出TriForce——一种可扩展至长序列生成的分层推测解码系统。该方法利用原始模型权重和通过检索获取的动态稀疏KV缓存作为草稿模型,该草稿模型在层级结构中充当中间层,并进一步由更小模型进行推测以降低其草稿生成延迟。TriForce不仅助力Llama2-7B-128K实现显著加速(在A100 GPU上达2.31倍),更展现出处理更长上下文时的可扩展性。在双RTX 4090 GPU的卸载设置中,TriForce达到0.108秒/令牌——其速度仅比A100自回归基线慢一半,而该基线在我们优化的卸载系统上实现了7.78倍加速。此外,在单张RTX 4090 GPU上,TriForce的性能达DeepSpeed-Zero-Inference的4.86倍。TriForce的鲁棒性体现在其在各种温度设置下持续优异的表现。代码见https://github.com/Infini-AI-Lab/TriForce。