A Pragmatic Approach to Learned Indexing in RocksDB: Targeted Optimizations with Minimal System Modification

Learned indexes have emerged as a promising alternative to traditional index structures, offering higher throughput and lower memory usage by approximating the cumulative key distribution function with lightweight models. Despite these benefits, adoption in production systems remains limited, partly because learned indexes that support concurrency and persistence as effectively as, e.g., the B+-Tree, do not yet exist, while many research prototypes introduce substantial complexity. In this paper, we investigate whether off-the-shelf learned indexes can be integrated into a production database with minimal storage-engine redesign. Using RocksDB as a case study, we exploit its separation between in-memory Memtables and immutable on-disk files to deploy specialized indexes at each level. We show that directly applying existing learned indexes is insufficient under write-heavy workloads because frequent Memtable replacement prevents models from fully adapting. To address this, we introduce a reuse mechanism that preserves structural knowledge across Memtable instances. At the storage level, we replace RocksDB's disk index with a learned index without modifying the storage layer or read path. We further adapt a read-only learned index to be block-aware, enabling worst-case single-I/O lookups. We implement these techniques in MountDB, an extension of RocksDB. Experiments on large-scale workloads with diverse data distributions and access patterns show up to 1.5X higher write throughput and 2.1X higher read throughput than state-of-the-art systems, demonstrating that established learned indexes can be integrated into production systems with minimal overhead and substantial performance benefits.

翻译：学习索引通过轻量级模型近似累积键分布函数，相比传统索引结构具有更高吞吐量和更低内存占用，已成为极具前景的替代方案。然而，由于现有学习索引尚无法像B+-树等结构那样有效支持并发与持久化，且大量研究原型引入显著的系统复杂度，其在生产系统中的部署仍然有限。本文研究能否在不显著改存储引擎设计的前提下，将现成的学习索引集成到生产级数据库中。以RocksDB为案例，我们利用其内存Memtable与不可变磁盘文件分离的架构特性，在不同层级部署专用化索引。实验发现，在写密集型负载下直接套用现有学习索引效果不佳，因为频繁的Memtable替换阻碍模型充分适应数据分布。为此，我们提出一种复用机制，跨Memtable实例保留结构知识。在存储层面，我们在不修改存储层或读路径的前提下，将RocksDB的磁盘索引替换为学习索引。我们还进一步将只读学习索引改造为块感知模式，实现最坏情况下的单次I/O查找。我们将这些技术实现为RocksDB扩展MountDB。在涵盖多种数据分布与访问模式的大规模负载实验中，MountDB的写入吞吐量相比现有最优系统提升达1.5倍，读取吞吐量提升达2.1倍，验证了现有学习索引能以极小开销集成到生产系统并带来显著性能收益。