Long-context LLMs have enabled numerous downstream applications but also introduced significant challenges related to computational and memory efficiency. To address these challenges, optimizations for long-context inference have been developed, centered around the KV cache. However, existing benchmarks often evaluate in single-request, neglecting the full lifecycle of the KV cache in real-world use. This oversight is particularly critical, as KV cache reuse has become widely adopted in LLMs inference frameworks, such as vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft, Google, and Anthropic. To address this gap, we introduce SCBench(SharedContextBench), a comprehensive benchmark for evaluating long-context methods from a KV cachecentric perspective: 1) KV cache generation, 2) KV cache compression, 3) KV cache retrieval, 4) KV cache loading. Specifically, SCBench uses test examples with shared context, ranging 12 tasks with two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task. With it, we provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs, Mamba-Attention hybrids, and efficient methods such as sparse attention, KV cache dropping, quantization, retrieval, loading, and prompt compression. The evaluation is conducted on 8 long-context LLMs. Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n^2) pre-filling computation perform robustly. Dynamic sparsity yields more expressive KV caches than static patterns, and layer-level sparsity in hybrid architectures reduces memory usage with strong performance. Additionally, we identify attention distribution shift issues in long-generation scenarios. https://aka.ms/SCBench.
翻译:长上下文大语言模型(LLM)催生了众多下游应用,同时也带来了计算与内存效率方面的重大挑战。为应对这些挑战,业界围绕KV缓存开发了针对长上下文推理的优化技术。然而,现有基准测试通常采用单请求评估模式,忽视了实际应用中KV缓存的全生命周期管理。这一疏漏尤为关键,因为KV缓存复用技术已在LLM推理框架(如vLLM和SGLang)以及LLM服务商(包括OpenAI、Microsoft、Google和Anthropic)中广泛采用。为弥补这一空白,我们提出了SCBench(共享上下文基准),这是一个从KV缓存中心视角评估长上下文方法的综合性基准,涵盖:1)KV缓存生成,2)KV缓存压缩,3)KV缓存检索,4)KV缓存加载。具体而言,SCBench采用具有共享上下文的测试样例,涵盖12项任务与两种共享上下文模式,覆盖四类长上下文能力:字符串检索、语义检索、全局信息处理与多任务处理。基于此,我们对八类长上下文解决方案进行了全面的KV缓存中心化分析,包括门控线性RNN、Mamba-Attention混合架构,以及稀疏注意力、KV缓存丢弃、量化、检索、加载与提示压缩等高效方法。评估在8个长上下文LLM上进行。研究结果表明:亚O(n)内存方法在多轮对话场景中表现欠佳,而采用O(n)内存与亚O(n²)预填充计算的稀疏编码方法表现稳健;动态稀疏模式比静态模式能生成表达能力更强的KV缓存;混合架构中的层级稀疏策略能以较低性能损失显著降低内存占用。此外,我们发现了长文本生成场景中注意力分布偏移的问题。https://aka.ms/SCBench。