A fundamental problem in data management is to find the elements in an array that match a query. Recently, learned indexes are being extensively used to solve this problem, where they learn a model to predict the location of the items in the array. They are empirically shown to outperform non-learned methods (e.g., B-trees or binary search that answer queries in $O(\log n)$ time) by orders of magnitude. However, success of learned indexes has not been theoretically justified. Only existing attempt shows the same query time of $O(\log n)$, but with a constant factor improvement in space complexity over non-learned methods, under some assumptions on data distribution. In this paper, we significantly strengthen this result, showing that under mild assumptions on data distribution, and the same space complexity as non-learned methods, learned indexes can answer queries in $O(\log\log n)$ expected query time. We also show that allowing for slightly larger but still near-linear space overhead, a learned index can achieve $O(1)$ expected query time. Our results theoretically prove learned indexes are orders of magnitude faster than non-learned methods, theoretically grounding their empirical success.
翻译:数据管理的一个基本问题是查找数组中与查询匹配的元素。近年来,学习索引被广泛用于解决这一问题,该方法通过训练模型来预测数组中元素的位置。实验表明,与在$O(\log n)$时间内回答查询的非学习方法(如B树或二分搜索)相比,学习索引的性能高出数个数量级。然而,学习索引的成功尚未得到理论上的证明。现有唯一尝试的研究在数据分布的一定假设下,仅表明学习索引与非学习方法具有相同的$O(\log n)$查询时间,但在空间复杂度上实现了常数因子的改进。本文显著强化了这一结果,证明在温和的数据分布假设以及与非学习方法相同的空间复杂度下,学习索引能够以$O(\log\log n)$的期望查询时间回答查询。我们还表明,若允许稍大但仍接近线性的空间开销,学习索引可实现$O(1)$的期望查询时间。我们的结果从理论上证明学习索引比非学习方法快数个数量级,为其实验上的成功提供了理论依据。