The log-structured merge tree (LSM-tree) is widely employed to build key-value (KV) stores. LSM-tree organizes multiple levels in memory and on disk. The compaction of LSM-tree, which is used to redeploy KV pairs between on-disk levels in the form of SST files, severely stalls its foreground service. We overhaul and analyze the procedure of compaction. Writing and persisting files with fsyncs for compacted KV pairs are time-consuming and, more important, occur synchronously on the critical path of compaction. The user-space compaction thread of LSM-tree stays waiting for completion signals from a kernel-space thread that is processing file write and fsync I/Os. We accordingly design a new LSM-tree variant named AisLSM with an asynchronous I/O model. In short, AisLSM conducts asynchronous writes and fsyncs for SST files generated in a compaction and overlaps CPU computations with disk I/Os for consecutive compactions. AisLSM tracks the generation dependency between input and output files for each compaction and utilizes a deferred check-up strategy to ensure the durability of compacted KV pairs. We prototype AisLSM with RocksDB and io_uring. Experiments show that AisLSM boosts the performance of RocksDB by up to 2.14x, without losing data accessibility and consistency. It also outperforms state-of-the-art LSM-tree variants with significantly higher throughput and lower tail latency.
翻译:日志结构合并树(LSM-tree)被广泛用于构建键值(KV)存储。LSM-tree在内存和磁盘中组织多个层级。LSM-tree的合并操作用于以SST文件形式在磁盘层级间重新部署KV对,但其严重阻塞了前台服务。我们对合并过程进行了全面梳理与分析。写入并基于fsync持久化合并后的KV文件耗时巨大,更关键的是,这一过程在合并的关键路径上同步发生。LSM-tree的用户态合并线程始终等待内核态线程处理文件写入及fsync I/O所返回的完成信号。为此,我们设计了一种新型LSM-tree变体——AisLSM,采用异步I/O模型。简言之,AisLSM对合并过程中生成的SST文件执行异步写入与fsync,并使得连续合并中的CPU计算与磁盘I/O重叠。AisLSM追踪每次合并中输入与输出文件间的生成依赖关系,并采用延迟校验策略确保合并后KV的持久性。我们基于RocksDB与io_uring实现了AisLSM原型。实验表明:AisLSM在不损失数据可访问性与一致性的前提下,将RocksDB性能提升至2.14倍;相比现有最先进的LSM-tree变体,AisLSM在吞吐量与时延尾分布方面均表现更优。