The rapid shift toward agentic and long-context workloads in Large Language Models (LLMs) is pushing the industry beyond the capacity of individual servers toward disaggregated shared storage to handle TB-scale context states. This movement has led to the emergence of specialized shared context layers designed to externalize and share cumulative inference states across distributed clusters. While offloading to a data processing unit (DPU) within just-a-bunch-of-flash (JBOF) architectures accelerates NVMe-over-fabrics (NVMe-oF) target processing, the need for sophisticated software-level optimization and cost-efficiency burdens remain significant. Consequently, the ideal architecture for scaling this shared context infrastructure is still an active area of exploration. In this paper, we propose ITME (Inference Tiered Memory Expansion), which leverages a CXL-hybrid memory to present a massive, TB-scale byte-addressable remote memory expansion. This approach enables cost-efficient scaling and simplifies the software stack through direct byte-addressability, effectively addressing the challenges of shared context infrastructure. Our key insight is that the deterministic access patterns of voluminous model weights and prefix caches enable the system to proactively manage data movement across the memory-storage hierarchy. We validate ITME by evaluating its performance potential with production-grade SK Hynix CMM and PCIe Gen5 NVMe SSDs, while further demonstrating its functional feasibility through an FPGA-based hardware prototype. Overall, ITME enhances conventional CPU-offloading by providing additional remote memory expansion to accommodate large KV cache footprints beyond host memory limits, achieving up to a 35.7\% throughput improvement.
翻译:大型语言模型(LLM)中智能体与长上下文工作负载的快速转型,正推动行业突破单服务器容量限制,转向解耦式共享存储以处理TB级上下文状态。这一趋势催生了专用共享上下文层的诞生,该层旨在将分布式集群中的累积推理状态进行外部化与共享。虽然通过将数据卸载至JBOF架构中的DPU可加速NVMe-oF目标处理进程,但软件层面的复杂优化需求与成本效益负担依然显著。因此,扩展该共享上下文基础设施的理想架构仍是活跃的探索领域。本文提出ITME(推理分层内存扩展),利用CXL混合存储实现大规模TB级字节可寻址远程内存扩展。该方法通过直接字节寻址能力实现经济高效的扩展并简化软件栈,有效解决了共享上下文基础设施面临的挑战。我们的核心洞见在于:海量模型权重与前缀缓存的确定性访问模式,使系统能够主动管理跨存储层级的数据移动。通过采用生产级SK海力士CMM与PCIe Gen5 NVMe SSD评估性能潜力,并基于FPGA硬件原型验证功能可行性,我们证实了ITME的有效性。总体而言,ITME通过提供超越宿主内存容量限制的额外远程内存扩展来容纳大规模KV缓存,将传统CPU卸载方案提升至最高35.7%的吞吐量改进。