Large language models (LLMs) are widely used but expensive to run, especially as inference workloads grow. To lower costs, maximizing the request batch size by managing GPU memory efficiently is crucial. While PagedAttention has recently been proposed to improve the efficiency of memory management, we find that the growing heterogeneity in the embeddings dimensions, attention, and access patterns of modern LLM architectures introduces new challenges for memory allocation. In this paper, we present Jenga, a novel memory allocation framework for heterogeneous embeddings in LLMs. Jenga tackles two key challenges: (1) minimizing memory fragmentation when managing embeddings of different sizes, and (2) enabling flexible caching and eviction policies tailored to the specific token-dependency patterns of various layers. Jenga employs a two-level memory allocator, leveraging the least common multiple (LCM) of embedding sizes to optimize memory usage and providing APIs to express layer-specific caching logic to enhance memory reuse. We implemente Jenga on vLLM, a state-of-the-art LLM inference engine, and evaluate it with diverse LLMs, datasets, and GPU configurations. Evaluations show that Jenga improves GPU memory utilization by up to 79.6%, and increases serving throughput by up to 4.92x (1.80x on average).
翻译:大语言模型(LLM)已得到广泛应用,但其运行成本高昂,尤其是在推理工作负载不断增长的情况下。为降低成本,通过高效管理GPU内存以最大化请求批处理规模至关重要。尽管近期提出的PagedAttention技术提升了内存管理效率,但我们发现现代LLM架构在嵌入维度、注意力机制及访问模式方面日益增长的异构性,为内存分配带来了新的挑战。本文提出Jenga——一种面向LLM异构嵌入的新型内存分配框架。Jenga着力解决两大关键挑战:(1)在管理不同尺寸嵌入时最小化内存碎片;(2)针对不同层特有的令牌依赖模式,实现灵活的缓存与淘汰策略。Jenga采用两级内存分配器,利用嵌入尺寸的最小公倍数(LCM)优化内存使用,并通过API支持表达面向特定层的缓存逻辑以提升内存复用率。我们在前沿LLM推理引擎vLLM上实现了Jenga,并采用多样化的LLM模型、数据集及GPU配置进行评估。实验表明,Jenga最高可提升79.6%的GPU内存利用率,并将服务吞吐量最高提升至4.92倍(平均提升1.80倍)。