Hardware based memory pooling enabled by interconnect standards like CXL have been gaining popularity amongst cloud providers and system integrators. While pooling memory resources has cost benefits, it comes at a penalty of increased memory access latency. With yet another addition to the memory hierarchy, local DRAM can be potentially used as a block cache(DRAM Cache) for fabric attached memory(FAM) and data prefetching techniques can be used to hide the FAM access latency. This paper proposes a system for prefetching sub-page blocks from FAM into DRAM cache for improving the data access latency and application performance. We further optimize our DRAM cache prefetch mechanism through enhancements that mitigate the performance degradation due to bandwidth contention at FAM. We consider the potential for providing additional functionality at the CXL-memory node through weighted fair queuing of demand and prefetch requests. We compare such a memory-node level approach to adapting prefetch rate at the compute-node based on observed latencies. We evaluate the proposed system in single node and multi-node configurations with applications from SPEC, PARSEC, Splash and GAP benchmark suites. Our evaluation suggests DRAM cache prefetching result in 7% IPC improvement and both of proposed optimizations can further increment IPC by 7-10%.
翻译:基于CXL等互连标准的硬件内存池化技术已在云服务提供商和系统集成商中日益普及。尽管内存资源池化具有成本优势,但其代价是内存访问延迟的增加。随着内存层次结构的进一步扩展,本地DRAM可作为结构附加内存(FAM)的块缓存(DRAM缓存),而数据预取技术可用于隐藏FAM访问延迟。本文提出了一种将子页块从FAM预取至DRAM缓存的系统,以改善数据访问延迟和应用性能。我们通过增强机制进一步优化DRAM缓存预取方案,以缓解因FAM带宽争用导致的性能下降。我们探讨了在CXL内存节点处通过需求请求与预取请求的加权公平队列实现附加功能的可行性,并将此种内存节点级方案与基于观测延迟在计算节点调整预取速率的方法进行比较。我们在单节点与多节点配置下使用SPEC、PARSEC、Splash和GAP基准测试套件中的应用程序进行评估。实验结果表明,DRAM缓存预取可实现7%的IPC提升,而两项优化方案可进一步将IPC提高7-10%。