Workload Behavior Driven Memory Subsystem Design for Hyperscale

Hyperscalars run services across a large fleet of servers, serving billions of users worldwide. These services, however, behave differently than commonly available benchmark suites, resulting in server architectures that are not optimized for cloud workloads. With datacenters becoming a primary server processor market, optimizing server processors for cloud workloads by better understanding their behavior has become crucial. To address this, in this paper, we present MemProf, a memory profiler that profiles the three major reasons for stalls in cloud workloads: code-fetch, memory bandwidth, and memory latency. We use MemProf to understand the behavior of cloud workloads and propose and evaluate micro-architectural and memory system design improvements that help cloud workloads' performance. MemProf's code analysis shows that cloud workloads execute the same code across CPU cores. Using this, we propose shared micro-architectural structures--a shared L2 I-TLB and a shared L2 cache. Next, to help with memory bandwidth stalls, using workloads' memory bandwidth distribution, we find that only a few pages contribute to most of the system bandwidth. We use this finding to evaluate a new high-bandwidth, small-capacity memory tier and show that it performs 1.46x better than the current baseline configuration. Finally, we look into ways to improve memory latency for cloud workloads. Profiling using MemProf reveals that L2 hardware prefetchers, a common solution to reduce memory latency, have very low coverage and consume a significant amount of memory bandwidth. To help improve hardware prefetcher performance, we built a memory tracing tool to collect and validate production memory access traces.

翻译：超大规模云服务商在全球范围内运行着数十万台服务器，为数十亿用户提供服务。然而，这些服务的行为与常见的基准测试套件存在显著差异，导致服务器架构未能针对云工作负载进行优化。随着数据中心成为服务器处理器的首要市场，通过深入了解云工作负载行为来优化服务器处理器变得至关重要。为此，本文提出了MemProf——一种针对云工作负载三大停滞原因（代码读取、内存带宽和内存延迟）进行剖析的内存分析工具。我们利用MemProf理解云工作负载行为，并提出并评估了有助于提升云工作负载性能的微架构与内存系统设计改进方案。MemProf的代码分析表明，云工作负载在多个CPU核心上执行相同的代码。基于此，我们提出了共享微架构结构——共享L2指令TLB和共享L2缓存。其次，针对内存带宽停滞问题，通过分析工作负载的内存带宽分布，我们发现仅有少数页面贡献了系统大部分带宽。基于这一发现，我们评估了一种高带宽、小容量的新型内存层级，其性能比当前基线配置提升1.46倍。最后，我们探索了改善云工作负载内存延迟的方法。MemProf的分析显示，作为降低内存延迟的常见解决方案，L2硬件预取器覆盖率极低且消耗大量内存带宽。为改进硬件预取器性能，我们构建了内存追踪工具，用于收集并验证生产环境的内存访问轨迹。