Log-Structured Merge-Trees (LSM-trees) dominate persistent key-value storage but suffer from high write amplification from 10x to 30x under random workloads due to repeated compaction. This overhead becomes prohibitive for large values with uniformly distributed keys, a workload common in content-addressable storage, deduplication systems, and blockchain validators. We present Tidehunter, a storage engine that eliminates value compaction by treating the Write-Ahead Log (WAL) as permanent storage rather than a temporary recovery buffer. Values are never overwritten; and small, lazily-flushed index tables map keys to WAL positions. Tidehunter introduces (a) lock-free writes that saturate NVMe drives through atomic allocation and parallel copying, (b) an optimistic index structure that exploits uniform key distributions for single-roundtrip lookups, and (c) epoch-based pruning that reclaims space without blocking writes. On a 1 TB dataset with 1 KB values, Tidehunter achieves 830K writes per second, that is 8.4x higher than RocksDB and 2.9x higher than BlobDB, while improving point queries by 1.7x and existence checks by 15.6x. We validate real-world impact by integrating Tidehunter into Sui, a high-throughput blockchain, where it maintains stable throughput and latency under loads that cause RocksDB-backed validators to collapse. Tidehunter is production-ready and is being deployed in production within Sui.
翻译:日志结构合并树(LSM-trees)在持久化键值存储领域占据主导地位,但其在随机工作负载下因重复压缩操作会产生高达10至30倍的写入放大。对于键值均匀分布的大规模数值存储场景——这种工作负载常见于内容寻址存储、重复数据删除系统及区块链验证节点中——此类开销变得难以承受。本文提出Tidehunter,一种通过将预写日志(WAL)作为永久存储而非临时恢复缓冲区来消除数值压缩的存储引擎。该引擎永不覆写数值数据,仅通过延迟刷新的小型索引表将键映射至WAL位置。Tidehunter创新性地引入:(a)通过原子分配与并行复制实现NVMe硬盘带宽饱和的无锁写入机制;(b)利用键值均匀分布特性实现单次往返查询的乐观索引结构;(c)基于时间周期的空间回收策略,可在不阻塞写入的情况下清理存储空间。在存储1TB数据(数值大小为1KB)的测试中,Tidehunter实现每秒83万次写入,性能达到RocksDB的8.4倍、BlobDB的2.9倍,同时点查询速度提升1.7倍,存在性检查效率提高15.6倍。我们将Tidehunter集成至高吞吐量区块链Sui以验证其实用价值:在导致RocksDB验证节点崩溃的负载压力下,该系统仍能保持稳定的吞吐量与延迟。Tidehunter已达到生产就绪标准,目前正在Sui生产环境中部署。