Computing the approximate quantiles or ranks of a stream is a fundamental task in data monitoring. Given a stream of elements $x_1, x_2, \dots, x_n$ and a query $x$, a relative-error quantile estimation algorithm can estimate the rank of $x$ with respect to the stream, up to a multiplicative $\pm \epsilon \cdot \mathrm{rank}(x)$ error. Notably, this requires the sketch to obtain more precise estimates for the ranks of elements on the tails of the distribution, as compared to the additive $\pm \epsilon n$ error regime. Previously, the best-known algorithms for relative error achieved space $\tilde O(\epsilon^{-1}\log^{1.5}(\epsilon n))$ (Cormode, Karnin, Liberty, Thaler, Vesel{\`y}, 2021) and $\tilde O(\epsilon^{-2}\log(\epsilon n))$ (Zhang, Lin, Xu, Korn, Wang, 2006). In this work, we present a nearly-optimal streaming algorithm for the relative-error quantile estimation problem using $\tilde O(\epsilon^{-1}\log(\epsilon n))$ space, which almost matches the trivial $\Omega(\epsilon^{-1} \log (\epsilon n))$ lower bound. To surpass the $\Omega(\epsilon^{-1}\log^{1.5}(\epsilon n))$ barrier of the previous approach, our algorithm crucially relies on a new data structure, called an elastic compactor, which can be dynamically resized over the course of the stream. Interestingly, we design a space allocation scheme which adaptively allocates space to each compactor based on the "hardness" of the input stream. This approach allows us to avoid using the maximal space simultaneously for every compactor and facilitates the improvement in the total space complexity. Along the way, we also propose and study a new problem called the Top Quantiles Problem, which only requires the sketch to provide estimates for a fixed-length tail of the distribution. This problem serves as an important subproblem in our algorithm, though it is also an interesting problem of its own right.
翻译:计算数据流的近似分位数或排名是数据监控中的一项基础任务。给定元素流 $x_1, x_2, \dots, x_n$ 和一个查询 $x$,相对误差分位数估计算法能够估计 $x$ 相对于数据流的排名,误差在 $\pm \epsilon \cdot \mathrm{rank}(x)$ 的乘法因子范围内。值得注意的是,与 $\pm \epsilon n$ 的加法误差机制相比,这要求草图对分布在尾部的元素排名获得更精确的估计。此前,针对相对误差问题,最著名的算法空间复杂度分别为 $\tilde O(\epsilon^{-1}\log^{1.5}(\epsilon n))$ (Cormode, Karnin, Liberty, Thaler, Vesel{\`y}, 2021) 和 $\tilde O(\epsilon^{-2}\log(\epsilon n))$ (Zhang, Lin, Xu, Korn, Wang, 2006)。在本工作中,我们提出了一种用于相对误差分位数估计问题的近最优流式算法,其空间复杂度为 $\tilde O(\epsilon^{-1}\log(\epsilon n))$,几乎匹配了平凡的 $\Omega(\epsilon^{-1} \log (\epsilon n))$ 下界。为了突破先前方法 $\Omega(\epsilon^{-1}\log^{1.5}(\epsilon n))$ 的障碍,我们的算法关键依赖于一种称为弹性压缩器的新数据结构,它可以在数据流处理过程中动态调整大小。有趣的是,我们设计了一种空间分配方案,该方案根据输入流的"难度"自适应地为每个压缩器分配空间。这种方法使我们能够避免同时为每个压缩器使用最大空间,从而有助于改善总空间复杂度。在此过程中,我们还提出并研究了一个称为顶部分位数的新问题,该问题仅要求草图提供分布中固定长度尾部的估计。这个问题是我们算法中的一个重要子问题,同时它本身也是一个有趣的问题。