Large-scale AI training and inference require hundreds of gigabytes to terabytes of DRAM with high peak to average utilization ratios, resulting in overprovisioning. In cloud computing, DRAM constitutes a significant share of the cost. Yet, as shown by recent articles, DRAM is heavily under utilized. Memory disaggregation is a solution to both these problems. With the advent of the CXL protocol, there is renewed interest in designing and optimizing computing systems with disaggregated memory. However, at present, there are limited simulation tools available for exploring the design space and evaluating the performance tradeoffs in computer systems with disaggregated memory. In this paper, we propose CXL-ClusterSim, a full-system modeling and simulation framework by combining the gem5 simulator for fidelity, with the Structural Simulation Toolkit (SST) for parallel simulation. We outline the challenges in creating this simulation infrastructure and present a design that is scalable, flexible, and reasonably fast to help computer architects to explore the design space of CXL-based disaggregated memory and identify new opportunities for hardware/software codesign and performance optimization.
翻译:大规模AI训练与推理需要数百GB到TB级DRAM,且其峰值利用率与平均利用率之比极高,导致过度配置问题。在云计算中,DRAM占据显著成本份额,但近期研究表明其利用率严重不足。内存解聚技术正是应对这两个问题的解决方案。随着CXL协议的问世,基于解聚内存的计算系统设计与优化再度成为研究热点。然而,目前探索解聚内存计算机系统的设计空间与评估其性能权衡的仿真工具极为有限。本文提出CXL-ClusterSim——一种全系统建模与仿真框架,融合gem5仿真器的高保真特性与SST(结构仿真工具包)的并行仿真能力。我们阐述了构建该仿真基础设施面临的挑战,并提出一种具备可扩展性、灵活性与合理仿真速度的设计方案,旨在帮助计算机体系结构研究者探索基于CXL的解聚内存设计空间,发掘硬件/软件协同设计与性能优化的新机遇。