High-Performance Computing (HPC) and Artificial Intelligence (AI) workloads typically demand substantial memory bandwidth and, to a degree, memory capacity. CXL memory expansion modules, also known as CXL "type-3" devices, enable enhancements in both memory capacity and bandwidth for server systems by utilizing the CXL protocol which runs over the PCIe interfaces of the processor. This paper discusses experimental findings on achieving increased memory bandwidth for HPC and AI workloads using Micron's CXL modules. This is the first study that presents real data experiments utilizing eight CXL E3.S (x8) Micron CZ122 devices on the Intel Xeon 6 processor 6900P (previously codenamed Granite Rapids AP) featuring 128 cores, alongside Micron DDR-5 memory operating at 6400 MT/s on each of the CPU's 12 DRAM channels. The eight CXL memories were set up as a unified NUMA configuration, employing software-based page level interleaving mechanism, available in Linux kernel v6.9+, between DDR5 and CXL memory nodes to improve overall system bandwidth. Memory expansion via CXL boosts read-only bandwidth by 24% and mixed read/write bandwidth by up to 39%. Across HPC and AI workloads, the geometric mean of performance speedups is 24%.
翻译:高性能计算(HPC)与人工智能(AI)工作负载通常对内存带宽有较高需求,并在一定程度上需要大内存容量。CXL内存扩展模块(亦称CXL“type-3”设备)通过运行在处理器PCIe接口上的CXL协议,能够提升服务器系统的内存容量与带宽。本文探讨了使用美光CXL模块为HPC与AI工作负载实现更高内存带宽的实验结果。本研究首次呈现了在配备128核心的英特尔至强6处理器6900P(此前代号为Granite Rapids AP)上的实测数据,该平台在CPU的12个DRAM通道上均配置了运行速率6400 MT/s的美光DDR5内存,并同时搭载了八个美光CZ122 CXL E3.S(x8)设备。八个CXL内存被配置为统一的NUMA架构,利用Linux内核v6.9+中提供的基于软件页级交错机制,在DDR5与CXL内存节点之间进行数据交错,以提升整体系统带宽。通过CXL进行内存扩展,使只读带宽提升了24%,混合读写带宽最高提升39%。在各类HPC与AI工作负载中,性能加速的几何平均值为24%。