GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

Large-scale deep neural networks (DNNs), such as large language models (LLMs), have revolutionized the artificial intelligence (AI) field and become increasingly popular. However, training or fine-tuning such models requires substantial computational power and resources, where the memory capacity of a single acceleration device like a GPU is one of the most important bottlenecks. Owing to the prohibitively large overhead (e.g., $10 \times$) of GPUs' native memory allocator, DNN frameworks like PyTorch and TensorFlow adopt a caching allocator that maintains a memory pool with a splitting mechanism for fast memory (de)allocation. Unfortunately, the caching allocator's efficiency degrades quickly for popular memory reduction techniques such as recomputation, offloading, distributed training, and low-rank adaptation. The primary reason is that those memory reduction techniques introduce frequent and irregular memory (de)allocation requests, leading to severe fragmentation problems for the splitting-based caching allocator. To mitigate this fragmentation problem, we propose a novel memory allocation framework based on low-level GPU virtual memory management called GPU memory lake (GMLake). GMLake employs a novel virtual memory stitching (VMS) mechanism, which can fuse or combine non-contiguous memory blocks with a virtual memory address mapping. GMLake can reduce an average of 9.2 GB (up to 25 GB) GPU memory usage and 15% (up to 33% ) fragmentation among eight LLM models on GPU A100 with 80 GB memory. GMLake is completely transparent to the DNN models and memory reduction techniques and ensures the seamless execution of resource-intensive deep-learning tasks. We have open-sourced GMLake at https://github.com/intelligent-machine-learning/glake/tree/main/GMLake.

翻译：大规模深度神经网络（DNN），如大型语言模型（LLM），已彻底变革人工智能（AI）领域并日益普及。然而，训练或微调此类模型需要巨大的计算能力和资源，其中单个加速设备（如GPU）的内存容量是最关键的瓶颈之一。由于GPU原生内存分配器的开销极高（例如$10 \times$），PyTorch和TensorFlow等DNN框架采用缓存分配器，通过维护带有分割机制的内存池实现快速内存分配与释放。不幸的是，对于重计算、卸载、分布式训练和低秩适配等流行内存缩减技术，缓存分配器的效率会迅速下降。主要原因是这些内存缩减技术引入了频繁且不规则的（内存）分配请求，导致基于分割的缓存分配器出现严重的碎片问题。为缓解此碎片问题，我们提出一种基于底层GPU虚拟内存管理的新型内存分配框架——GPU内存湖（GMLake）。GMLake采用创新的虚拟内存拼接（VMS）机制，可通过虚拟内存地址映射融合或组合非连续内存块。在配备80GB内存的GPU A100上，针对八种LLM模型，GMLake平均可减少9.2GB（最高25GB）的GPU内存使用量，并将碎片率降低15%（最高33%）。GMLake对DNN模型和内存缩减技术完全透明，确保资源密集型深度学习任务的无缝执行。我们已将GMLake开源在https://github.com/intelligent-machine-learning/glake/tree/main/GMLake。