Taking the Leap: Efficient and Reliable Fine-Grained NUMA Migration in User-space

Modern multi-socket architectures offer a single virtual address space, but physically divide main-memory across multiple regions, where each region is attached to a CPU and its cores. While this simplifies the usage, developers must be aware of non-uniform memory access (NUMA), where an access by a thread running on a core-local NUMA region is significantly cheaper than an access from a core-remote region. Obviously, if query answering is parallelized across the cores of multiple regions, then the portion of the database on which the query is operating should be distributed across the same regions to ensure local accesses. As the present data placement might not fit this, migrating pages from one NUMA region to another can be performed to improve the situation. To do so, different options exist: One option is to rely on automatic NUMA balancing integrated in Linux, which is steered by the observed access patterns and frequency. Another option is to actively trigger migration via the system call move_pages(). Unfortunately, both variants have significant downsides in terms of their feature set and performance. As an alternative, we propose a new user-space migration method called page_leap() that can perform page migration asynchronously at a high performance by exploiting features of the virtual memory subsystem. The method is (a) actively triggered by the user, (b) ensures that all pages are eventually migrated, (c) handles concurrent writes correctly, (d) supports pooled memory, (e) adaptively adjusts its migration granularity based on the workload, and (f) supports both small pages and huge pages.

翻译：现代多插槽架构提供单一虚拟地址空间，但物理上将主内存划分为多个区域，每个区域连接至特定CPU及其核心。这种设计虽简化了使用，但开发者必须关注非统一内存访问（NUMA）特性：线程访问其运行核心所在NUMA区域的内存时，开销远低于访问远端区域。显然，若查询处理在多区域核心间并行执行，则查询操作所涉及的数据库部分应分布在相同区域以确保本地访问。由于现有数据布局可能不符合此要求，可通过将页面从一个NUMA区域迁移至另一区域来改善状况。现有方案包括：一是依赖Linux集成的自动NUMA平衡机制，该机制基于观测到的访问模式和频率进行调度；二是通过系统调用move_pages()主动触发迁移。遗憾的是，这两种方案在功能集和性能方面均存在显著缺陷。为此，我们提出一种名为page_leap()的新型用户空间迁移方法，该方法通过利用虚拟内存子系统的特性，能够以高性能异步执行页面迁移。该方法具有以下特征：（a）由用户主动触发，（b）确保所有页面最终完成迁移，（c）正确处理并发写入，（d）支持池化内存，（e）根据工作负载自适应调整迁移粒度，（f）同时支持普通页与大页。