Workload Behavior Driven Memory Subsystem Design for Hyperscale

Hyperscalars run services across a large fleet of servers, serving billions of users worldwide. These services, however, behave differently than commonly available benchmark suites, resulting in server architectures that are not optimized for cloud workloads. With datacenters becoming a primary server processor market, optimizing server processors for cloud workloads by better understanding their behavior has become crucial. To address this, in this paper, we present MemProf, a memory profiler that profiles the three major reasons for stalls in cloud workloads: code-fetch, memory bandwidth, and memory latency. We use MemProf to understand the behavior of cloud workloads and propose and evaluate micro-architectural and memory system design improvements that help cloud workloads' performance. MemProf's code analysis shows that cloud workloads execute the same code across CPU cores. Using this, we propose shared micro-architectural structures--a shared L2 I-TLB and a shared L2 cache. Next, to help with memory bandwidth stalls, using workloads' memory bandwidth distribution, we find that only a few pages contribute to most of the system bandwidth. We use this finding to evaluate a new high-bandwidth, small-capacity memory tier and show that it performs 1.46$\times$ better than the current baseline configuration. Finally, we look into ways to improve memory latency for cloud workloads. Profiling using MemProf reveals that L2 hardware prefetchers, a common solution to reduce memory latency, have very low coverage and consume a significant amount of memory bandwidth. To help improve hardware prefetcher performance, we built a memory tracing tool to collect and validate production memory access traces.

翻译：超大规模运营商通过大规模服务器集群运行服务，为全球数十亿用户提供服务。然而，这些服务的行为与常见的基准测试套件存在显著差异，导致服务器架构未能针对云工作负载进行优化。随着数据中心成为服务器处理器的主要市场，通过深入理解云工作负载的行为来优化服务器处理器变得至关重要。为此，本文提出了MemProf，一种内存分析工具，用于分析云工作负载中导致停顿的三个主要原因：代码预取、内存带宽和内存延迟。我们利用MemProf理解云工作负载的行为，并提出并评估了微架构和内存系统设计的改进方案，以提升云工作负载的性能。MemProf的代码分析表明，云工作负载在CPU核心间执行相同的代码。基于此，我们提出了共享微架构结构——共享L2指令TLB和共享L2缓存。此外，针对内存带宽停顿问题，通过分析工作负载的内存带宽分布，我们发现只有少数页面贡献了系统带宽的大部分。利用这一发现，我们评估了一种新的高带宽、小容量内存层级，其性能比当前基准配置提升了1.46倍。最后，我们探索了改善云工作负载内存延迟的方法。使用MemProf的分析显示，作为降低内存延迟的常见方案，L2硬件预取器的覆盖率极低且消耗大量内存带宽。为了提升硬件预取器性能，我们构建了一个内存追踪工具，用于收集和验证生产环境中的内存访问轨迹。