The development of high-speed storage devices such as NVMe SSDs has shifted the primary I/O bottleneck from hardware to software. Modern database systems also rely on kernel-based I/O paths, where frequent system call invocations and kernel-user space transitions lead to relatively large overheads and performance degradation. This issue is particularly pronounced in Log-Structured Merge-tree (LSM-tree)-based NoSQL databases. We identified that, in particular, the background compaction process generates a large number of read system calls, causing significant overhead. To address this problem, we propose RESYSTANCE, which leverages eBPF and io_uring to free compaction from system calls and unlock hidden performance potential. RESYSTANCE improves disk I/O efficiency during read operations via io uring and significantly reduces software stack overhead by handling compaction directly inside the kernel through eBPF. Moreover, RESYSTANCE minimizes user-kernel transitions by offloading key I/O routines into the kernel without modifying the LSM-tree structure or compaction algorithm. RESYSTANCE was extensively evaluated using db_bench, YCSB, and OLTP workloads. Compared to baseline RocksDB, it reduced the average number of system call invocations during compaction by 99% and shortened compaction time by 50%. Consequently, in write-intensive workloads, RESYSTANCE improved throughput by up to 75% and reduced the p99 latency by 40%.
翻译:随着NVMe SSD等高速存储设备的发展,I/O瓶颈已从硬件转向软件。现代数据库系统同样依赖基于内核的I/O路径,频繁的系统调用和内核-用户空间切换导致较大的开销和性能下降。这一问题在基于日志结构合并树(LSM-tree)的NoSQL数据库中尤为突出。我们发现,后台压缩过程会产生大量读系统调用,造成显著开销。为解决此问题,我们提出RESYSTANCE,它利用eBPF和io_uring将压缩过程从系统调用中解放出来,释放隐藏的性能潜力。RESYSTANCE通过io_uring提升读操作期间的磁盘I/O效率,并借助eBPF在内核中直接处理压缩,显著降低软件栈开销。此外,RESYSTANCE通过将关键I/O例程卸载到内核中执行,最大限度地减少了用户-内核切换,且无需修改LSM树结构或压缩算法。我们使用db_bench、YCSB和OLTP工作负载对RESYSTANCE进行了全面评估。与基准RocksDB相比,它将压缩期间的平均系统调用次数降低了99%,压缩时间缩短了50%。因此,在写密集型工作负载中,RESYSTANCE将吞吐量最高提升75%,并将p99延迟降低40%。