Emerging applications, such as big data analytics and machine learning, require increasingly large amounts of main memory, often exceeding the capacity of current commodity processors built on DRAM technology. To address this, recent research has focused on off-chip memory controllers that facilitate access to diverse memory media, each with unique density and latency characteristics. While these solutions improve memory system performance, they also exacerbate the already significant memory latency. As a result, multi-level prefetching techniques are essential to mitigate these extended latencies. This paper investigates the advantages of prefetching across both sides of the memory system: the off-chip memory and the on-chip cache hierarchy. Our primary objective is to assess the impact of a multi-level prefetching engine on overall system performance. Additionally, we analyze the individual contribution of each prefetching level to system efficiency. To achieve this, the study evaluates two key prefetching approaches: HMC (Hybrid Memory Controller) and HMC+L1, both of which employ prefetching mechanisms commonly used by processor vendors. The HMC approach integrates a prefetcher within the off-chip hybrid memory controller, while the HMC+L1 approach combines this with additional L1 on-chip prefetchers. Experimental results on an out-of-order execution processor show that on-chip cache prefetchers are crucial for maximizing the benefits of off-chip prefetching, which in turn further enhances performance. Specifically, the off-chip HMC prefetcher achieves coverage and accuracy rates exceeding 60% and up to 80%, while the combined HMC+L1 approach boosts off-chip prefetcher coverage to as much as 92%. Consequently, overall performance increases from 9% with the HMC approach to 12% when L1 prefetching is also employed.
翻译:新兴应用(如大数据分析和机器学习)对主内存容量的需求日益增长,往往超出当前基于DRAM技术的商用处理器的承载能力。为应对这一挑战,近期研究聚焦于支持访问多种存储介质的片外内存控制器,这些介质各自具有独特的密度与延迟特性。尽管此类方案提升了内存系统性能,却也加剧了本就显著的内存延迟问题。因此,多级预取技术对于缓解这类扩展延迟至关重要。本文探究了在内存系统两侧(片外内存与片内缓存层次)实施预取技术的优势。我们的核心目标是评估多级预取引擎对整体系统性能的影响,并解析各级预取对系统效率的具体贡献。为此,本研究评估了两种关键预取方案:HMC(混合内存控制器)与HMC+L1,二者均采用处理器厂商常用的预取机制。HMC方案在片外混合内存控制器中集成预取器,而HMC+L1方案在此基础上额外结合了L1片内预取器。在乱序执行处理器上的实验结果表明:片内缓存预取器对于充分发挥片外预取效益具有关键作用,并能进一步实现性能增益。具体而言,片外HMC预取器的覆盖率和准确率分别超过60%并最高可达80%,而HMC+L1组合方案将片外预取器覆盖率提升至92%。最终,系统整体性能从单独使用HMC方案时的9%提升至结合L1预取后的12%。