TurboMem: High-Performance Lock-Free Memory Pool with Transparent Huge Page Auto-Merging for DPDK

High-speed packet processing on multicore CPUs places extreme demands on memory allocators. In systems like DPDK, fixed-size memory pools back packet buffers (mbufs) to avoid costly dynamic allocation. However, even DPDK's optimized mempool faces scalability limits: lock contention on the shared ring, cache-coherence ping-pong between cores, and heavy TLB pressure from thousands of small pages. To mitigate these issues, DPDK typically uses explicit huge pages (2 MB or 1 GB) for its memory pools. This reduces TLB misses but requires manual configuration and can lead to fragmentation and inflexibility. We propose TurboMem, a novel C++ template-based memory pool that addresses these challenges. TurboMem combines a fully lock-free design (using atomic stacks and per-core local caches) with Transparent Huge Page (THP) auto merging. By automatically promoting pools to 2 MB pages via madvise(MADV_HUGEPAGE), TurboMem achieves the benefits of huge pages without manual setup. We also enforce strict NUMA locality and CPU affinity, so each core allocates and frees objects from its local node. Using Intel VTune on a single-socket 100 Gbps testbed, we show that TurboMem boosts packet throughput by up to 28% while reducing TLB misses by 41% compared to a standard DPDK mempool with explicit huge pages. These results demonstrate that THP auto-merging can outperform manually reserved huge pages in low-fragmentation scenarios, and that modern C++ lock-free programming yields practical gains in data-plane software. Note: The performance claims reported in this preliminary version (up to 28% higher throughput and 41% fewer TLB misses) are based on mock benchmarks. Comprehensive real-system evaluations using Intel VTune are currently underway and will be presented in a future revision.

翻译：在多核CPU上进行高速数据包处理对内存分配器提出了极高要求。在DPDK等系统中，固定大小的内存池为数据包缓冲区（mbuf）提供支持，以避免昂贵的动态分配。然而，即便是DPDK优化后的内存池也存在可扩展性瓶颈：共享环上的锁竞争、核间缓存一致性乒乓效应以及来自数千个小页面的TLB压力。为缓解这些问题，DPDK通常对其内存池使用显式大页（2 MB或1 GB）。这减少了TLB缺失，但需要手动配置，并可能导致碎片化和缺乏灵活性。我们提出TurboMem，一种新颖的基于C++模板的内存池，以应对这些挑战。TurboMem将完全无锁的设计（使用原子栈和每核本地缓存）与透明大页（THP）自动合并相结合。通过madvise(MADV_HUGEPAGE)自动将内存池提升为2 MB页面，TurboMem无需手动设置即可实现大页的优势。我们还强制实施严格的NUMA局部性和CPU亲和性，使每个核从其本地节点分配和释放对象。使用Intel VTune在单插槽100 Gbps测试平台上，我们表明与使用显式大页的标准DPDK内存池相比，TurboMem可将数据包吞吐量提升高达28%，同时TLB缺失减少41%。这些结果表明，在低碎片化场景下，THP自动合并可优于手动预留的大页，并且现代C++无锁编程在数据平面软件中带来了实际收益。注：本初版中报告的性能声明（吞吐量提升高达28%，TLB缺失减少41%）基于模拟基准测试。使用Intel VTune进行的全面真实系统评估正在进行中，将在未来修订版中呈现。