We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that oversubscribe memory, achieving high speedups. However, the DRAM cache needs to be carefully designed to address the latency and BW limitations of the SCM while minimizing cost overhead and considering GPU's characteristics. Because the massive number of GPU threads can thrash the DRAM cache, we first propose an SCM-aware DRAM cache bypass policy for GPUs that considers the multi-dimensional characteristics of memory accesses by GPUs with SCM to bypass DRAM for data with low performance utility. In addition, to reduce DRAM cache probes and increase effective DRAM BW with minimal cost, we propose a Configurable Tag Cache (CTC) that repurposes part of the L2 cache to cache DRAM cacheline tags. The L2 capacity used for the CTC can be adjusted by users for adaptability. Furthermore, to minimize DRAM cache probe traffic from CTC misses, our Aggregated Metadata-In-Last-column (AMIL) DRAM cache organization co-locates all DRAM cacheline tags in a single column within a row. The AMIL also retains the full ECC protection, unlike prior DRAM cache's Tag-And-Data (TAD) organization. Additionally, we propose SCM throttling to curtail power and exploiting SCM's SLC/MLC modes to adapt to workload's memory footprint. While our techniques can be used for different DRAM and SCM devices, we focus on a Heterogeneous Memory Stack (HMS) organization that stacks SCM dies on top of DRAM dies for high performance. Compared to HBM, HMS improves performance by up to 12.5x (2.9x overall) and reduces energy by up to 89.3% (48.1% overall). Compared to prior works, we reduce DRAM cache probe and SCM write traffic by 91-93% and 57-75%, respectively.
翻译:我们提出通过高容量存储级内存(SCM)与DRAM缓存组合,突破GPU的内存容量限制。通过SCM显著提升内存容量后,GPU在处理内存超分负载时可捕获比HBM更大的内存足迹,从而获得显著加速。然而,DRAM缓存需精心设计以应对SCM的延迟和带宽限制,同时最小化成本开销并兼顾GPU特性。鉴于GPU大量线程可能引发DRAM缓存抖动,我们首先为GPU提出一种感知SCM的DRAM缓存旁路策略,该策略综合考虑GPU与SCM的内存访问多维特性,对低性能效用的数据跳过DRAM缓存。此外,为减少DRAM缓存探测次数并以最低成本提升有效DRAM带宽,我们提出可配置标签缓存(CTC)技术,通过复用部分L2缓存来缓存DRAM缓存行标签。CTC使用的L2容量可由用户按需调整以实现自适应。进一步地,为降低CTC未命中导致的DRAM缓存探测流量,我们提出聚合元数据末列(AMIL)DRAM缓存组织方式,将全部DRAM缓存行标签集中存储在行内的单一列中。相较于先前DRAM缓存的标签数据(TAD)组织方式,AMIL同时保留了完整ECC保护。此外,我们提出SCM限流技术以控制功耗,并利用SCM的SLC/MLC模式适配工作负载的内存足迹。尽管本技术可适用于不同DRAM与SCM器件,我们聚焦于高性能异构内存栈(HMS)组织架构——将SCM芯片堆叠于DRAM芯片之上。与HBM相比,HMS最高可实现12.5倍性能提升(整体2.9倍),能耗降低最高89.3%(整体48.1%)。相较先前工作,我们将DRAM缓存探测流量与SCM写入流量分别降低91-93%和57-75%。