Non-Uniform Memory Access (NUMA) architecture imposes numerous performance challenges to today's cloud workloads. Due to the complexity and the massive scale of modern warehouse-scale computers (WSCs), a lot of efforts need to be done to improve the memory access locality on the NUMA architecture. In Baidu, we have found that NUMA optimization has significant performance benefit to the major workloads like Search and Feed (Baidu's recommendation system). But how to conduct NUMA optimization within the large scale cluster brings a lot of subtle complexities and workload-specific scenario optimizations. In this paper, we will present a production environment deployed solution in Baidu called MAP (Memory Access Optimizer) that helps improve the memory access locality for Baidu's various workloads. MAO includes an online module and an offline module. The online module is responsible for the online monitoring, dynamic NUMA node binding and runtime optimization. Meanwhile the offline workload characterization module will proceed with the data analysis and resource-sensitivity module training. We also proposed a new performance model called "NUMA Sensitivity model" to address the impact of remote memory access to workload performance and projection of the potential performance improvements via NUMA optimization for a specific workload. Based on continuous data collected from online monitoring, this model is proved to be working properly in MAO. As of today, we have successfully deployed MAO to more than one hundred thousand servers. In our Feed product, we have achieved 12.1% average latency improvements and 9.8% CPU resource saving.
翻译:非一致性内存访问(NUMA)架构给当今云工作负载带来了诸多性能挑战。由于现代仓储级计算机(WSC)的复杂性和庞大规模,提升NUMA架构上的内存访问局部性需要大量工作。在百度实践中,我们发现NUMA优化对搜索和Feed(百度推荐系统)等核心工作负载具有显著的性能增益。然而,在大规模集群中实施NUMA优化会引入大量精细复杂性及工作负载特定的场景优化问题。本文提出一种在百度生产环境部署的解决方案——内存访问优化器(MAO),该方案可提升百度各类工作负载的内存访问局部性。MAO包含在线模块与离线模块:在线模块负责实时监控、动态NUMA节点绑定及运行时优化;离线工作负载特征分析模块则进行数据分析与资源敏感性模型训练。我们同时提出一种称为“NUMA敏感性模型”的新型性能模型,用于量化远程内存访问对工作负载性能的影响,并预测特定工作负载通过NUMA优化可能获得的性能提升。基于在线监控持续收集的数据,该模型在MAO中被验证运行有效。截至目前,MAO已成功部署至超过十万台服务器。在Feed产品中,我们实现了12.1%的平均延迟降低与9.8%的CPU资源节约。