Index structures are fundamental for efficient query processing on large-scale datasets. Learned indexes model the indexing process as a prediction problem to overcome the inherent trade-offs of traditional indexes. However, most existing learned indexes optimize only for limited objectives like query latency or space usage, neglecting other practical evaluation dimensions such as update efficiency and stability. Moreover, many learned indexes rely on assumptions about data distributions or workloads, lacking theoretical guarantees when facing unknown or evolving scenarios, which limits their generality in real-world systems. In this paper, we propose LMIndex, a robust framework for learned indexing that leverages a efficient query/update top-layer structure (theoretically $O(1)$ when the key type is fixed) and a efficient optimal error threshold training algorithm (approach $O(1)$ in practice). Building upon this, we develop LMG (LMIndex with gaps), a variant employing a novel gap allocation strategy to enhance update performance and maintain stability under dynamic workloads. Extensive evaluations show that LMG achieves competitive or leading performance, including bulk loading (up to 8.25$\times$ faster), point queries (up to 1.49$\times$ faster), range queries (up to 4.02$\times$ faster than B+Tree), update (up to 1.5$\times$ faster on read-write workloads), stability (up to 82.59$\times$ lower coefficient of variation), and space usage (up to 1.38$\times$ smaller). These results demonstrate that LMG effectively breaks the multi-dimensional performance trade-offs inherent in state-of-the-art approaches, offering a balanced and versatile framework.
翻译:索引结构是处理大规模数据集高效查询的基础。学习索引将索引过程建模为预测问题,以克服传统索引固有的权衡限制。然而,现有学习索引大多仅针对有限目标(如查询延迟或空间占用)进行优化,忽视了更新效率与稳定性等其他实际评估维度。此外,许多学习索引依赖于数据分布或工作负载的假设,在面对未知或动态变化场景时缺乏理论保证,这限制了其在现实系统中的普适性。本文提出LMIndex,一种鲁棒的学习索引框架,其采用高效的查询/更新顶层结构(当键类型固定时理论复杂度为$O(1)$)与高效的最优误差阈值训练算法(实际中接近$O(1)$)。在此基础上,我们进一步开发了LMG(带间隙的LMIndex),该变体通过新颖的间隙分配策略提升更新性能,并在动态工作负载下保持稳定性。大量实验评估表明,LMG在多项指标上取得了具有竞争力或领先的性能,包括批量加载(最高提升8.25倍)、点查询(最高提升1.49倍)、范围查询(相比B+Tree最高提升4.02倍)、更新(在读-写混合负载下最高提升1.5倍)、稳定性(变异系数最高降低82.59倍)以及空间占用(最高减少1.38倍)。这些结果表明,LMG有效打破了现有先进方法中固有的多维性能权衡,提供了一个均衡且通用的索引框架。