Prefetching and off-chip prediction are two techniques proposed to hide long memory access latencies in high-performance processors. In this work, we demonstrate that: (1) prefetching and off-chip prediction often provide complementary performance benefits, yet (2) naively combining them often fails to realize their full performance potential, and (3) existing prefetcher control policies leave significant room for performance improvement behind. Our goal is to design a holistic framework that can autonomously learn to coordinate an off-chip predictor with multiple prefetchers employed at various cache levels. To this end, we propose a new technique called Athena, which models the coordination between prefetchers and off-chip predictor (OCP) as a reinforcement learning (RL) problem. Athena acts as the RL agent that observes multiple system-level features (e.g., prefetcher/OCP accuracy, bandwidth usage) over an epoch of program execution, and uses them as state information to select a coordination action (i.e., enabling the prefetcher and/or OCP, and adjusting prefetcher aggressiveness). At the end of every epoch, Athena receives a numerical reward that measures the change in multiple system-level metrics (e.g., number of cycles taken to execute an epoch). Athena uses this reward to autonomously and continuously learn a policy to coordinate prefetchers with OCP. Our extensive evaluation using a diverse set of memory-intensive workloads shows that Athena consistently outperforms prior state-of-the-art coordination policies across a wide range of system configurations with various combinations of underlying prefetchers, OCPs, and main memory bandwidths, while incurring only modest storage overhead. Athena is freely available at https://github.com/CMU-SAFARI/Athena.
翻译:预取与片外预测是两种旨在隐藏高性能处理器中长内存访问延迟的技术。本研究表明:(1) 预取与片外预测通常能提供互补的性能收益,然而(2) 简单组合二者往往无法实现其全部性能潜力,且(3) 现有预取器控制策略仍存在显著的性能提升空间。我们的目标是设计一个整体性框架,能够自主学习协调片外预测器与多级缓存中部署的多个预取器。为此,我们提出名为Athena的新技术,将预取器与片外预测器(OCP)的协同建模为强化学习(RL)问题。Athena作为RL智能体,在程序执行的每个周期内观测多个系统级特征(如预取器/OCP准确率、带宽使用情况),并将其作为状态信息来选择协同动作(即启用预取器和/或OCP,并调整预取器激进程度)。每个周期结束时,Athena接收量化多个系统级指标变化(如执行周期所需时钟周期数)的数值奖励。Athena利用该奖励自主持续学习协调预取器与OCP的策略。我们通过多样化内存密集型工作负载的广泛评估表明:在包含不同底层预取器、OCP及主存带宽组合的多种系统配置中,Athena始终优于现有最先进的协同策略,同时仅产生适中的存储开销。Athena已在https://github.com/CMU-SAFARI/Athena开源发布。