Storage disaggregation, wherein storage is accessed over the network, is popular because it allows applications to independently scale storage capacity and bandwidth based on dynamic application demand. However, the added network processing introduced by disaggregation can consume significant CPU resources. In many storage systems, logical storage operations (e.g., lookups, aggregations) involve a series of simple but dependent I/O access patterns. Therefore, one way to reduce the network processing overhead is to execute dependent series of I/O accesses at the remote storage server, reducing the back-and-forth communication between the storage layer and the application. We refer to this approach as \emph{remote-storage pushdown}. We present BPF-oF, a new remote-storage pushdown protocol built on top of NVMe-oF, which enables applications to safely push custom eBPF storage functions to a remote storage server. The main challenge in integrating BPF-oF with storage systems is preserving the benefits of their client-based in-memory caches. We address this challenge by designing novel caching techniques for storage pushdown, including splitting queries into separate in-memory and remote-storage phases and periodically refreshing the client cache with sampled accesses from the remote storage device. We demonstrate the utility of BPF-oF by integrating it with three storage systems, including RocksDB, a popular persistent key-value store that has no existing storage pushdown capability. We show BPF-oF provides significant speedups in all three systems when accessed over the network, for example improving RocksDB's throughput by up to 2.8$\times$ and tail latency by up to 2.6$\times$.
翻译:存储解耦(即通过网络访问存储)因其允许应用根据动态需求独立扩展存储容量与带宽而广受欢迎。然而,解耦引入的网络处理会消耗大量CPU资源。在许多存储系统中,逻辑存储操作(如查找、聚合)涉及一系列简单但相互依赖的I/O访问模式。因此,降低网络处理开销的一种方式是将相互依赖的I/O访问序列在远程存储服务器端执行,从而减少存储层与应用之间的往返通信。我们将此方法称为“远程存储下推”。本文提出BPF-oF,一种基于NVMe-oF构建的新型远程存储下推协议,支持应用安全地将自定义eBPF存储函数下推至远程存储服务器。将BPF-oF集成至存储系统的主要挑战在于保留基于客户端的内存缓存的优势。我们通过设计针对存储下推的新型缓存技术来应对这一挑战,包括将查询拆分为独立的内存阶段与远程存储阶段,以及利用远程存储设备的采样访问定期刷新客户端缓存。通过将BPF-oF集成至三个存储系统(包括RocksDB——一种尚无现有存储下推能力的流行持久化键值存储系统),我们验证了其效用。实验表明,在通过网络访问时,BPF-oF使三个系统均获得显著加速,例如将RocksDB的吞吐量提升至2.8倍,尾延迟降低至2.6倍。