Engram conditional memory has emerged as a promising component for LLMs by decoupling static knowledge lookup from dynamic computation. Since Engram exhibits sparse access patterns and supports prefetching, its massive embedding tables are well-suited for offloading to lower-tier memory. In this paper, we propose using Compute Express Link (CXL) memory pool for Engram storage. Compared to RDMA, CXL provides fine-grained and low-latency access required by minimal and discrete retrieval patterns of Engram. We integrate the CXL-based Engram pool into SGLang, achieving near-DRAM end-to-end performance. This provides a scalable and cost-efficient storage solution for future Engram-integrated LLMs without compromising inference performance.
翻译:记忆印痕条件内存通过将静态知识检索与动态计算解耦,已成为大语言模型(LLM)中颇具前景的组件。由于记忆印痕表现出稀疏的访问模式并支持预取,其海量嵌入表非常适合卸载至低层级内存。本文提出使用计算快速链接(CXL)内存池存储记忆印痕。与RDMA相比,CXL提供了记忆印痕最小化且离散的检索模式所需的细粒度低延迟访问。我们将基于CXL的记忆印痕池集成至SGLang中,实现了接近DRAM的端到端性能。这为未来集成记忆印痕的LLM提供了一种可扩展且经济高效的存储解决方案,同时不影响推理性能。