Dynamic Indexing Through Learned Indices with Worst-case Guarantees

Indexing data is a fundamental problem in computer science. Recently, various papers apply machine learning to this problem. For a fixed integer $\varepsilon$, a \emph{learned index} is a function $h : \mathcal{U} \rightarrow [0, n]$ where $\forall q \in \mathcal{U}$, $h(q) \in [\text{rank}(q) - \varepsilon, \text{rank}(q) + \varepsilon]$. These works use machine learning to compute $h$. Then, they store $S$ in a sorted array $A$ and access $A[\lfloor h(q) \rfloor]$ to answer queries in $O(k + \varepsilon + \log |h|)$ time. Here, $k$ denotes the output size and $|h|$ the complexity of $h$. Ferragina and Vinciguerra (VLDB 2020) observe that creating a learned index is a geometric problem. They define the PGM index by restricting $h$ to a piecewise linear function and show a linear-time algorithm to compute a PGM index of approximate minimum complexity. Since indexing queries are decomposable, the PGM index may be made dynamic through the logarithmic method. When allowing deletions, range query times deteriorate to worst-case $O(N + \sum\limits_i^{\lceil \log n \rceil } (\varepsilon + \log |h_i|))$ time (where $N$ is the largest size of $S$ seen so far). This paper offers a combination of theoretical insights and experiments as we apply techniques from computational geometry to dynamically maintain an approximately minimum-complexity learned index $h : \mathcal{U} \rightarrow [0, n]$ with $O(\log^2 n)$ update time. We also prove that if we restrict $h$ to lie in a specific subclass of piecewise-linear functions, then we can combine $h$ and hash maps to support queries in $O(k + \varepsilon + \log |h|)$ time (at the cost of increasing $|h|$). We implement our algorithm and compare it to the existing implementation. Our empirical analysis shows that our solution supports more efficient range queries whenever the update sequence contains many deletions.

翻译：数据索引是计算机科学中的一个基本问题。近期，多篇论文将机器学习应用于该问题。对于一个固定整数$\varepsilon$，\emph{学习索引}是一个函数$h : \mathcal{U} \rightarrow [0, n]$，其中$\forall q \in \mathcal{U}$，$h(q) \in [\text{rank}(q) - \varepsilon, \text{rank}(q) + \varepsilon]$。这些工作利用机器学习计算$h$。随后，将数据集$S$存储在有序数组$A$中，并通过访问$A[\lfloor h(q) \rfloor]$在$O(k + \varepsilon + \log |h|)$时间内完成查询。此处$k$表示输出规模，$|h|$表示$h$的复杂度。Ferragina与Vinciguerra（VLDB 2020）指出构建学习索引本质上是一个几何问题。他们通过将$h$限制为分段线性函数定义了PGM索引，并提出线性时间算法以计算近似最小复杂度的PGM索引。由于索引查询具有可分解性，可通过对数方法使PGM索引动态化。在允许删除操作时，范围查询时间会恶化至最坏情况$O(N + \sum\limits_i^{\lceil \log n \rceil } (\varepsilon + \log |h_i|))$（其中$N$是$S$迄今为止达到的最大规模）。本文结合理论分析与实验验证，将计算几何技术应用于动态维护近似最小复杂度的学习索引$h : \mathcal{U} \rightarrow [0, n]$，实现$O(\log^2 n)$的更新时间复杂度。我们同时证明，若将$h$限制在特定分段线性函数子类中，则可结合$h$与哈希映射在$O(k + \varepsilon + \log |h|)$时间内支持查询（代价是增加$|h|$）。我们实现了该算法并与现有方案进行对比。实证分析表明，当更新序列包含大量删除操作时，我们的解决方案能支持更高效的范围查询。