Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. This surge in demand poses significant challenges in optimizing throughput and latency while keeping costs manageable. The Key-Value (KV) cache, a standard method for retaining previous computations, makes LLM inference highly bounded by memory. While batching strategies can enhance performance, they frequently lead to significant memory fragmentation. Even though cutting-edge systems like vLLM mitigate KV cache fragmentation using paged Attention mechanisms, they still suffer from inefficient memory and computational operations due to the tightly coupled page management and computation kernels. This study introduces the vTensor, an innovative tensor structure for LLM inference based on GPU virtual memory management (VMM). vTensor addresses existing limitations by decoupling computation from memory defragmentation and offering dynamic extensibility. Our framework employs a CPU-GPU heterogeneous approach, ensuring efficient, fragmentation-free memory management while accommodating various computation kernels across different LLM architectures. Experimental results indicate that vTensor achieves an average speedup of 1.86x across different models, with up to 2.42x in multi-turn chat scenarios. Additionally, vTensor provides average speedups of 2.12x and 3.15x in kernel evaluation, reaching up to 3.92x and 3.27x compared to SGLang Triton prefix-prefilling kernels and vLLM paged Attention kernel, respectively. Furthermore, it frees approximately 71.25% (57GB) of memory on the NVIDIA A100 GPU compared to vLLM, enabling more memory-intensive workloads.
翻译:大语言模型(LLM)已广泛应用于各领域,每日处理数百万请求。需求的激增对在控制成本的同时优化吞吐量和延迟提出了重大挑战。键值(KV)缓存作为保留先前计算的标准方法,使得LLM推理高度受限于内存。虽然批处理策略可以提升性能,但常导致显著的内存碎片化。尽管vLLM等前沿系统通过分页注意力机制缓解了KV缓存碎片化问题,但由于其紧密耦合的页面管理与计算内核,仍存在内存和计算操作效率低下的问题。本研究提出了vTensor,一种基于GPU虚拟内存管理(VMM)的创新型LLM推理张量结构。vTensor通过解耦计算与内存碎片整理,并提供动态可扩展性,解决了现有局限。我们的框架采用CPU-GPU异构方法,在适配不同LLM架构的各种计算内核的同时,确保了高效、无碎片的内存管理。实验结果表明,vTensor在不同模型上平均实现了1.86倍的加速,在多轮对话场景中最高可达2.42倍。此外,在核函数评估中,vTensor相比SGLang Triton前缀预填充内核和vLLM分页注意力内核,分别平均提升了2.12倍和3.15倍,最高可达3.92倍和3.27倍。同时,与vLLM相比,它在NVIDIA A100 GPU上释放了约71.25%(57GB)的内存,从而能够支持更密集的内存工作负载。