A Comprehensive Simulation Framework for CXL Disaggregated Memory

Compute eXpress Link (CXL) is a pivotal technology for memory disaggregation in future heterogeneous computing systems, enabling on-demand memory expansion and improved resource utilization. Despite its potential, CXL is in its early stages with limited market products, highlighting the need for a reliable system-level simulation tool. This paper introduces CXL-DMSim, an open-source, high-fidelity full-system simulator for CXL disaggregated memory systems, comparable in speed to gem5. CXL-DMSim includes a flexible CXL memory expander model, device driver, and support for CXL.io and CXL.mem protocols. It supports both app-managed and kernel-managed modes, with the latter featuring a NUMA-compatible mechanism. Rigorous verification against real hardware testbeds with FPGA-based and ASIC-based CXL memory prototypes confirms CXL-DMSim's accuracy, with an average simulation error of 4.1%. Benchmark results using LMbench and STREAM indicate that CXL-FPGA memory has approximately ~2.88x higher latency than local DDR, while CXL-ASIC latency is about ~2.18x. CXL-FPGA achieves 45-69% of local DDR's memory bandwidth, and CXL-ASIC reaches 82-83%. The performance of CXL memory is significantly more sensitive to Rd/Wr patterns than local DDR, with optimal bandwidth at a 74%:26% ratio rather than 50%:50% due to the current CXL+DDR controller design. The study also shows that CXL memory can markedly enhance the performance of memory-intensive applications, with the most improvement seen in Viper (~23x) and in bandwidth-sensitive scenarios like MERCI (16%). CXL-DMSim's observability and expandability are demonstrated through detailed case studies, showcasing its potential for research on future CXL-interconnected hybrid memory pools.

翻译：计算快速链路（CXL）是未来异构计算系统中实现内存解耦的关键技术，能够支持按需内存扩展并提升资源利用率。尽管潜力巨大，CXL目前仍处于早期发展阶段，市面产品有限，这突显了对可靠系统级仿真工具的迫切需求。本文提出CXL-DMSim——一个开源、高保真的CXL解耦内存系统全系统仿真器，其仿真速度与gem5相当。CXL-DMSim包含灵活的CXL内存扩展器模型、设备驱动，并支持CXL.io与CXL.mem协议。该框架同时支持应用托管与内核托管两种模式，后者具备NUMA兼容机制。通过在基于FPGA和ASIC的CXL内存原型硬件测试平台上进行严格验证，确认CXL-DMSim的平均仿真误差为4.1%，证明了其准确性。使用LMbench与STREAM的基准测试结果表明：CXL-FPGA内存的延迟约为本地DDR的2.88倍，而CXL-ASIC的延迟约为本地DDR的2.18倍。CXL-FPGA可实现本地DDR内存带宽的45-69%，CXL-ASIC则可达到82-83%。相较于本地DDR，CXL内存性能对读写模式更为敏感，由于当前CXL+DDR控制器的设计，其在读写比例为74%:26%时达到最优带宽，而非50%:50%。研究还表明，CXL内存能显著提升内存密集型应用的性能，其中Viper应用提升最大（约23倍），在MERCI等带宽敏感场景中也可实现16%的性能提升。通过详尽的案例研究，展示了CXL-DMSim在可观测性与可扩展性方面的优势，彰显了其在未来CXL互连混合内存池研究中的应用潜力。