TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory

Hasan Al Maruf,Hao Wang,Abhishek Dhanotia,Johannes Weiner,Niket Agarwal,Pallab Bhattacharya,Chris Petersen,Mosharaf Chowdhury,Shobhit Kanaujia,Prakash Chauhan

The increasing demand for memory in hyperscale applications has led to memory becoming a large portion of the overall datacenter spend. The emergence of coherent interfaces like CXL enables main memory expansion and offers an efficient solution to this problem. In such systems, the main memory can constitute different memory technologies with varied characteristics. In this paper, we characterize memory usage patterns of a wide range of datacenter applications across the server fleet of Meta. We, therefore, demonstrate the opportunities to offload colder pages to slower memory tiers for these applications. Without efficient memory management, however, such systems can significantly degrade performance. We propose a novel OS-level application-transparent page placement mechanism (TPP) for CXL-enabled memory. TPP employs a lightweight mechanism to identify and place hot/cold pages to appropriate memory tiers. It enables a proactive page demotion from local memory to CXL-Memory. This technique ensures a memory headroom for new page allocations that are often related to request processing and tend to be short-lived and hot. At the same time, TPP can promptly promote performance-critical hot pages trapped in the slow CXL-Memory to the fast local memory, while minimizing both sampling overhead and unnecessary migrations. TPP works transparently without any application-specific knowledge and can be deployed globally as a kernel release. We evaluate TPP in the production server fleet with early samples of new x86 CPUs with CXL 1.1 support. TPP makes a tiered memory system performant as an ideal baseline (<1% gap) that has all the memory in the local tier. It is 18% better than today's Linux, and 5-17% better than existing solutions including NUMA Balancing and AutoTiering. Most of the TPP patches have been merged in the Linux v5.18 release.

翻译：超大规模应用中日益增长的内存需求导致内存成为整体数据中心支出的重要组成部分。CXL等一致性接口的出现使主存扩展成为可能，并为该问题提供了高效解决方案。在此类系统中，主存可由特性各异的不同内存技术构成。本文对Meta服务器集群中广泛的数据中心应用进行了内存使用模式分析，从而展示了将这些应用中的冷页卸载至更慢速内存层的可行性。然而，若缺乏高效的内存管理，此类系统将显著降低性能。我们针对CXL内存提出了一种新颖的操作系统级应用透明页面放置机制（TPP）。TPP采用轻量级机制识别冷/热页并将其置于合适的内存层，实现了从本地内存到CXL内存的主动降级。该技术确保了新页面分配所需的内存空间——这些分配通常与请求处理相关且具有短暂存活和高温特性。同时，TPP能及时将滞留在慢速CXL内存中的性能关键热页提升至快速本地内存，同时最小化采样开销和不必要迁移。TPP无需任何应用特定知识即可透明运行，可作为内核版本全球部署。我们利用支持CXL 1.1的新型x86 CPU早期样本在生产服务器集群中评估了TPP。TPP使分片内存系统的性能接近将所有内存置于本地层的理想基线（差距<1%），比当前Linux提高18%，比包括NUMA Balancing和AutoTiering在内的现有方案提高5-17%。大部分TPP补丁已合入Linux v5.18版本。