The ever-growing demands for memory with larger capacity and higher bandwidth have driven recent innovations on memory expansion and disaggregation technologies based on Compute eXpress Link (CXL). Especially, CXL-based memory expansion technology has recently gained notable attention for its ability not only to economically expand memory capacity and bandwidth but also to decouple memory technologies from a specific memory interface of the CPU. However, since CXL memory devices have not been widely available, they have been emulated using DDR memory in a remote NUMA node. In this paper, for the first time, we comprehensively evaluate a true CXL-ready system based on the latest 4th-generation Intel Xeon CPU with three CXL memory devices from different manufacturers. Specifically, we run a set of microbenchmarks not only to compare the performance of true CXL memory with that of emulated CXL memory but also to analyze the complex interplay between the CPU and CXL memory in depth. This reveals important differences between emulated CXL memory and true CXL memory, some of which will compel researchers to revisit the analyses and proposals from recent work. Next, we identify opportunities for memory-bandwidth-intensive applications to benefit from the use of CXL memory. Lastly, we propose a CXL-memory-aware dynamic page allocation policy, Caption to more efficiently use CXL memory as a bandwidth expander. We demonstrate that Caption can automatically converge to an empirically favorable percentage of pages allocated to CXL memory, which improves the performance of memory-bandwidth-intensive applications by up to 24% when compared to the default page allocation policy designed for traditional NUMA systems.
翻译:随着对更大容量和更高带宽内存需求的持续增长,基于计算快速链接(Compute eXpress Link,CXL)的内存扩展与分解技术成为近期创新热点。特别是CXL内存扩展技术,因其不仅能经济高效地扩展内存容量与带宽,还能使内存技术与CPU特定内存接口解耦而备受关注。然而由于CXL内存设备尚未普及,现有研究多采用远程NUMA节点中的DDR内存进行模拟。本文首次基于最新第四代英特尔至强CPU与来自三家不同厂商的三款CXL内存设备,对真实CXL就绪系统展开全面评估。具体而言,我们通过运行一组微基准测试,不仅对比了真实CXL内存与模拟CXL内存的性能差异,更深入分析了CPU与CXL内存之间的复杂交互机制。研究揭示了模拟CXL内存与真实CXL内存间的关键差异,其中部分发现将促使研究者重新审视近期工作中的分析与方案。进而我们识别出内存带宽密集型应用利用CXL内存的优化机遇。最后提出一种CXL内存感知的动态页面分配策略Caption,以更高效地将CXL内存用作带宽扩展器。实验表明,与为传统NUMA系统设计的默认页面分配策略相比,Caption能自动收敛至经验最优的CXL内存页面分配比例,使内存带宽密集型应用性能提升最高达24%。