The rapid evolution of Large Language Models (LLMs) towards long-context reasoning and sparse architectures has pushed memory requirements far beyond the capacity of individual device HBM. While emerging supernode architectures offer terabyte-scale shared memory pools via high-bandwidth interconnects, existing software stacks fail to exploit this hardware effectively. Current runtime-based offloading and swapping techniques operate with a local view, leading to reactive scheduling and exposed communication latency that stall the computation pipeline. In this paper, we propose the SuperNode Memory Management Framework (\textbf{HyperOffload}). It employs a compiler-assisted approach that leverages graph-driven memory management to treat remote memory access as explicit operations in the computation graph, specifically designed for hierarchical SuperNode architectures. Unlike reactive runtime systems, SuperNode represents data movement using cache operators within the compiler's Intermediate Representation (IR). This design enables a global, compile-time analysis of tensor lifetimes and execution dependencies. Leveraging this visibility, we develop a global execution-order refinement algorithm that statically schedules data transfers to hide remote memory latency behind compute-intensive regions. We implement SuperNode within the production deep learning framework MindSpore, adding a remote memory backend and specialized compiler passes. Evaluation on representative LLM workloads shows that SuperNode reduces peak device memory usage by up to 26\% for inference while maintaining end-to-end performance. Our work demonstrates that integrating memory-augmented hardware into the compiler's optimization framework is essential for scaling next-generation AI workloads.
翻译:大语言模型(LLM)向长上下文推理与稀疏架构的快速发展,使其内存需求远超单个设备高带宽内存的容量。尽管新兴的超节点架构通过高速互连提供了太字节级别的共享内存池,现有软件栈仍无法有效利用此类硬件。当前基于运行时的卸载与交换技术仅具备局部视野,导致反应式调度及暴露的通信延迟阻塞计算流水线。本文提出超节点内存管理框架(**HyperOffload**),采用编译器辅助的图驱动内存管理方法,将远程内存访问视为计算图中的显式操作,专为分级式超节点架构设计。与反应式运行时系统不同,HyperOffload通过在编译器中间表示层引入缓存算子来表征数据移动。该设计支持对张量生命周期与执行依赖关系进行全局的编译时分析。基于此全局视图,我们开发了全局执行顺序优化算法,通过静态调度数据传输将远程内存延迟隐藏于计算密集区域之后。我们在生产级深度学习框架MindSpore中实现了HyperOffload,新增了远程内存后端及专用编译器通道。在典型LLM负载上的评估表明,HyperOffload在保持端到端性能的同时,最高可降低推理任务26%的设备峰值内存使用。本研究表明,将内存增强硬件集成至编译器优化框架对于扩展下一代AI工作负载至关重要。