numaPTE: Managing Page-Tables and TLBs on NUMA Systems

Memory management operations that modify page-tables, typically performed during memory allocation/deallocation, are infamous for their poor performance in highly threaded applications, largely due to process-wide TLB shootdowns that the OS must issue due to the lack of hardware support for TLB coherence. We study these operations in NUMA settings, where we observe up to 40x overhead for basic operations such as munmap or mprotect. The overhead further increases if page-table replication is used, where complete coherent copies of the page-tables are maintained across all NUMA nodes. While eager system-wide replication is extremely effective at localizing page-table reads during address translation, we find that it creates additional penalties upon any page-table changes due to the need to maintain all replicas coherent. In this paper, we propose a novel page-table management mechanism, called numaPTE, to enable transparent, on-demand, and partial page-table replication across NUMA nodes in order to perform address translation locally, while avoiding the overheads and scalability issues of system-wide full page-table replication. We then show that numaPTE's precise knowledge of page-table sharers can be leveraged to significantly reduce the number of TLB shootdowns issued upon any memory-management operation. As a result, numaPTE not only avoids replication-related slowdowns, but also provides significant speedup over the baseline on memory allocation/deallocation and access control operations. We implement numaPTEin Linux on x86_64, evaluate it on 4- and 8-socket systems, and show that numaPTE achieves the full benefits of eager page-table replication on a wide range of applications, while also achieving a 12% and 36% runtime improvement on Webserver and Memcached respectively due to a significant reduction in TLB shootdowns.

翻译：修改页表的内存管理操作（通常在内存分配/释放期间执行）因其在高线程应用程序中的性能低下而臭名昭著，这主要归因于操作系统因缺乏对TLB一致性的硬件支持而必须发出的进程级TLB击落。我们在NUMA环境下研究了这些操作，发现类似munmap或mprotect等基本操作的开销高达40倍。若采用页表复制（即跨所有NUMA节点维护页表的完整一致副本），开销将进一步增加。尽管主动的全系统复制在地址转换期间本地化页表读取方面极为有效，但我们发现，由于需要维护所有副本的一致性，任何页表修改都会带来额外代价。本文提出了一种名为numaPTE的新型页表管理机制，该机制能够跨NUMA节点实现透明、按需且部分的页表复制，以便在本地执行地址转换，同时避免全系统完全页表复制带来的开销和可扩展性问题。我们进一步证明，利用numaPTE对页表共享者的精确感知，可以显著减少任何内存管理操作触发的TLB击落次数。因此，numaPTE不仅避免了与复制相关的性能下降，还在内存分配/释放及访问控制操作中相较于基线实现了显著加速。我们在x86_64架构的Linux上实现了numaPTE，并在4路和8路系统上进行了评估。结果表明，numaPTE在广泛的应用场景中能够获得主动页表复制的全部优势，同时由于TLB击落大幅减少，Webserver和Memcached的运行时间分别提升了12%和36%。