HyperOffload: Graph-Driven Hierarchical Memory Management for Large Language Models on SuperNode Architectures

The rapid evolution of Large Language Models (LLMs) towards long-context reasoning and sparse architectures has pushed memory requirements far beyond the capacity of individual device HBM. While emerging supernode architectures offer terabyte-scale shared memory pools via high-bandwidth interconnects, existing software stacks fail to exploit this hardware effectively. Current runtime-based offloading and swapping techniques operate with a local view, leading to reactive scheduling and exposed communication latency that stall the computation pipeline. In this paper, we propose the SuperNode Memory Management Framework (\textbf{HyperOffload}). It employs a compiler-assisted approach that leverages graph-driven memory management to treat remote memory access as explicit operations in the computation graph, specifically designed for hierarchical SuperNode architectures. Unlike reactive runtime systems, SuperNode represents data movement using cache operators within the compiler's Intermediate Representation (IR). This design enables a global, compile-time analysis of tensor lifetimes and execution dependencies. Leveraging this visibility, we develop a global execution-order refinement algorithm that statically schedules data transfers to hide remote memory latency behind compute-intensive regions. We implement SuperNode within the production deep learning framework MindSpore, adding a remote memory backend and specialized compiler passes. Evaluation on representative LLM workloads shows that SuperNode reduces peak device memory usage by up to 26\% for inference while maintaining end-to-end performance. Our work demonstrates that integrating memory-augmented hardware into the compiler's optimization framework is essential for scaling next-generation AI workloads.

翻译：大语言模型（LLM）向长上下文推理与稀疏架构的快速发展，已使其内存需求远超单个设备高带宽内存的容量。尽管新兴的超节点架构通过高带宽互连提供了太字节级别的共享内存池，现有软件栈仍无法有效利用此类硬件。当前基于运行时的卸载与交换技术仅具备局部视野，导致反应式调度及暴露的通信延迟阻塞计算流水线。本文提出超节点内存管理框架（**HyperOffload**）。该框架采用编译器辅助方法，利用图驱动内存管理将远程内存访问视为计算图中的显式操作，专为分层式超节点架构设计。与反应式运行时系统不同，HyperOffload通过在编译器的中间表示中引入缓存操作符来表征数据移动。该设计支持对张量生命周期与执行依赖进行全局的编译时分析。基于此全局可见性，我们开发了全局执行顺序优化算法，通过静态调度数据传输将远程内存延迟隐藏在计算密集型区域之后。我们在生产级深度学习框架MindSpore中实现了HyperOffload，新增了远程内存后端与专用编译器通道。在典型LLM负载上的评估表明，HyperOffload在保持端到端性能的同时，将推理任务的峰值设备内存使用降低了最高达26%。本研究表明，将内存增强硬件集成至编译器优化框架对于扩展下一代人工智能负载至关重要。