TurboMem: High-Performance Lock-Free Memory Pool with Transparent Huge Page Auto-Merging for DPDK

from arxiv, 7 pages, 2 figures, 4 tables; v2: Added explicit disclaimer in abstract clarifying that all performance numbers are based on mock benchmarks (real VTune results forthcoming). Minor formatting corrections

High-speed packet processing on multicore CPUs places extreme demands on memory allocators. In systems like DPDK, fixed-size memory pools back packet buffers (mbufs) to avoid costly dynamic allocation. However, even DPDK's optimized mempool faces scalability limits: lock contention on the shared ring, cache-coherence ping-pong between cores, and heavy TLB pressure from thousands of small pages. To mitigate these issues, DPDK typically uses explicit huge pages (2 MB or 1 GB) for its memory pools. This reduces TLB misses but requires manual configuration and can lead to fragmentation and inflexibility. We propose TurboMem, a novel C++ template-based memory pool that addresses these challenges. TurboMem combines a fully lock-free design (using atomic stacks and per-core local caches) with Transparent Huge Page (THP) auto merging. By automatically promoting pools to 2 MB pages via madvise(MADV_HUGEPAGE), TurboMem achieves the benefits of huge pages without manual setup. We also enforce strict NUMA locality and CPU affinity, so each core allocates and frees objects from its local node. Using Intel VTune on a single-socket 100 Gbps testbed, we show that TurboMem boosts packet throughput by up to 28% while reducing TLB misses by 41% compared to a standard DPDK mempool with explicit huge pages. These results demonstrate that THP auto-merging can outperform manually reserved huge pages in low-fragmentation scenarios, and that modern C++ lock-free programming yields practical gains in data-plane software. Note: The performance claims reported in this preliminary version (up to 28% higher throughput and 41% fewer TLB misses) are based on mock benchmarks. Comprehensive real-system evaluations using Intel VTune are currently underway and will be presented in a future revision.

翻译：摘要：多核CPU上的高速数据包处理对内存分配器提出了极高要求。在如DPDK等系统中，固定大小的内存池负责管理数据包缓冲区（mbuf），以避免开销巨大的动态分配。然而，即使是DPDK优化的内存池也面临可扩展性限制：共享环上的锁争用，核心间的缓存一致性乒乓，以及数千个小页面带来的严重TLB压力。为缓解这些问题，DPDK通常为其内存池使用显式大页（2 MB或1 GB）。这减少了TLB缺失，但需要手动配置，并可能导致碎片化和灵活性不足。我们提出TurboMem，一种新颖的基于C++模板的内存池，以应对这些挑战。TurboMem将完全无锁设计（使用原子栈和每核本地缓存）与透明大页（THP）自动合并相结合。通过madvise(MADV_HUGEPAGE)自动将内存池提升为2 MB页面，TurboMem无需手动设置即可实现大页的优势。我们还强制执行严格的NUMA局部性和CPU亲和性，使每个核从其本地节点分配和释放对象。在单插槽100 Gbps测试台上使用Intel VTune进行测试，结果表明，与使用显式大页的标准DPDK内存池相比，TurboMem将数据包吞吐量提升高达28%，同时将TLB缺失减少41%。这些结果证明，在低碎片化场景下，THP自动合并的性能优于手动预留的大页，并且现代C++无锁编程在数据面软件中带来了实际收益。注意：本初步版本中报告的性能声明（吞吐量提升高达28%、TLB缺失减少41%）基于模拟基准测试。使用Intel VTune进行的全面真实系统评估正在进行中，将在未来修订版中呈现。