Employ SmartNICs' Data Path Accelerators for Ordered Key-Value Stores

Remote in-memory key-value (KV) stores serve as a cornerstone for diverse modern workloads, and high-speed range scans are frequently a requirement. However, current architectures rarely achieve a simultaneous balance of peak efficiency, architectural simplicity, and native support for ordered operations. Conventional host-centric frameworks are restricted by kernel-space network stacks and internal bus latencies. While hash-based alternatives that utilize OS-bypass or run natively on SmartNICs offer high throughput, they lack the data structures necessary for range queries. Distributed RDMA-based systems provide performance and range functionality but often depend on stateful clients, which introduces complexity in scaling and error handling. Alternatively, SmartNIC implementations that traverse trees located in host memory are hampered by high DMA round-trip latencies. This paper introduces a KV store that leverages the on-path Data Path Accelerators (DPAs) of the BlueField-3 SmartNIC to eliminate operating system overhead while facilitating stateless clients and range operations. These DPAs ingest network requests directly from NIC buffers to navigate a lock-free learned index residing in the accelerator's local memory. By deferring value retrieval from the host-side tree replica until the leaf level is reached, the design minimizes PCIe crossings. Write operations are staged in DPA memory and migrated in batches to the host, where structural maintenance is performed before being transactionally stitched back to the SmartNIC. Coupled with a NIC-resident read cache, the system achieves 33 million operations per second (MOPS) for point lookups and 13 MOPS for range queries. Our analysis demonstrates that this architecture matches or exceeds the performance of contemporary state-of-the-art solutions, while we identify hardware refinements that could further accelerate performance.

翻译：远程内存键值（KV）存储是现代多样化工作负载的基石，而高速范围扫描往往是其常见需求。然而，现有架构很少能同时兼顾峰值效率、架构简洁性以及对有序操作的原生支持。传统以主机为中心的框架受限于内核态网络协议栈和内部总线延迟。虽然基于哈希且利用操作系统旁路或直接在SmartNIC上运行的替代方案提供了高吞吐量，但它们缺乏支持范围查询所需的数据结构。基于分布式RDMA的系统提供了性能和范围查询功能，但通常依赖于有状态客户端，这给扩展和错误处理带来了复杂性。另一方面，在SmartNIC上实现遍历位于主机内存中的树结构，则受限于高DMA往返延迟。本文提出一种KV存储系统，它利用BlueField-3 SmartNIC的在线数据路径加速器（DPA）来消除操作系统开销，同时支持无状态客户端和范围操作。这些DPA直接从NIC缓冲区接收网络请求，以遍历驻留在加速器本地内存中的无锁学习索引。通过将值检索（从主机端树副本）推迟到到达叶节点时才执行，该设计最大限度地减少了PCIe穿越次数。写操作暂存在DPA内存中，并批量迁移到主机，在主机端完成结构维护后，再以事务方式缝合回SmartNIC。结合驻留在NIC中的读缓存，该系统实现了每秒3300万次操作（MOPS）的点查询性能和1300万MOPS的范围查询性能。我们的分析表明，该架构达到或超越了当前最先进解决方案的性能，同时我们指出了可进一步加速性能的硬件改进方向。