Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

Long-latency load requests continue to limit the performance of high-performance processors. To increase the latency tolerance of a processor, architects have primarily relied on two key techniques: sophisticated data prefetchers and large on-chip caches. In this work, we show that: 1) even a sophisticated state-of-the-art prefetcher can only predict half of the off-chip load requests on average across a wide range of workloads, and 2) due to the increasing size and complexity of on-chip caches, a large fraction of the latency of an off-chip load request is spent accessing the on-chip cache hierarchy. The goal of this work is to accelerate off-chip load requests by removing the on-chip cache access latency from their critical path. To this end, we propose a new technique called Hermes, whose key idea is to: 1) accurately predict which load requests might go off-chip, and 2) speculatively fetch the data required by the predicted off-chip loads directly from the main memory, while also concurrently accessing the cache hierarchy for such loads. To enable Hermes, we develop a new lightweight, perceptron-based off-chip load prediction technique that learns to identify off-chip load requests using multiple program features (e.g., sequence of program counters). For every load request, the predictor observes a set of program features to predict whether or not the load would go off-chip. If the load is predicted to go off-chip, Hermes issues a speculative request directly to the memory controller once the load's physical address is generated. If the prediction is correct, the load eventually misses the cache hierarchy and waits for the ongoing speculative request to finish, thus hiding the on-chip cache hierarchy access latency from the critical path of the off-chip load. Our evaluation shows that Hermes significantly improves performance of a state-of-the-art baseline. We open-source Hermes.

翻译：长延迟加载请求持续限制着高性能处理器的性能。为提升处理器的延迟容忍度，架构师主要依赖两种关键技术：复杂的数据预取器和大型片上缓存。本研究显示：1）即使最先进的预取器，在广泛的工作负载中平均也只能预测一半的片外加载请求；2）由于片上缓存规模和复杂度的增加，片外加载请求的延迟中有很大一部分花费在访问片上缓存层次结构上。本文目标是通过将片上缓存访问延迟从片外加载请求的关键路径中移除，从而加速此类请求。为此，我们提出一种名为Hermes的新技术，其核心思想是：1）准确预测哪些加载请求可能触达片外；2）针对预测为片外访问的加载请求，从主存储器推测性地获取所需数据，并同时并发访问缓存层次结构。为实现Hermes，我们开发了一种基于感知器的轻量级片外加载预测技术，该技术通过利用多个程序特征（例如程序计数器序列）来识别片外加载请求。对于每个加载请求，预测器观察一组程序特征，以预测该加载是否可能触达片外。若预测为片外访问，Hermes会在加载请求的物理地址生成后，直接向内存控制器发送一个推测性请求。若预测正确，该加载最终将错过缓存层次结构，并等待正在进行的推测性请求完成，从而将片上缓存层次结构的访问延迟从片外加载的关键路径中隐藏。评估表明，Hermes显著提升了最先进基线的性能。我们已将Hermes开源。