Estimating quantiles is one of the foundational problems of data sketching. Given $n$ elements $x_1, x_2, \dots, x_n$ from some universe of size $U$ arriving in a data stream, a quantile sketch estimates the rank of any element with additive error at most $\varepsilon n$. A low-space algorithm solving this task has applications in database systems, network measurement, load balancing, and many other practical scenarios. Current quantile estimation algorithms described as optimal include the GK sketch (Greenwald and Khanna 2001) using $O(\varepsilon^{-1} \log n)$ words (deterministic) and the KLL sketch (Karnin, Lang, and Liberty 2016) using $O(\varepsilon^{-1} \log\log(1/\delta))$ words (randomized, with failure probability $\delta$). However, both algorithms are only optimal in the comparison-based model, whereas most typical applications involve streams of integers that the sketch can use aside from making comparisons. If we go beyond the comparison-based model, the deterministic q-digest sketch (Shrivastava, Buragohain, Agrawal, and Suri 2004) achieves a space complexity of $O(\varepsilon^{-1}\log U)$ words, which is incomparable to the previously-mentioned sketches. It has long been asked whether there is a quantile sketch using $O(\varepsilon^{-1})$ words of space (which is optimal as long as $n \leq \mathrm{poly}(U)$). In this work, we present a deterministic algorithm using $O(\varepsilon^{-1})$ words, resolving this line of work.
翻译:分位数估计是数据概要的基础问题之一。给定来自规模为$U$的论域中、以数据流形式到达的$n$个元素$x_1, x_2, \dots, x_n$,分位数概要对任意元素的秩估计具有不超过$\varepsilon n$的加法误差。解决该问题的低空间算法在数据库系统、网络测量、负载均衡及其他众多实际场景中均有应用。当前被描述为最优的分位数估计算法包括:使用$O(\varepsilon^{-1} \log n)$字空间(确定性算法)的GK概要(Greenwald和Khanna,2001年),以及使用$O(\varepsilon^{-1} \log\log(1/\delta))$字空间(随机化算法,失败概率为$\delta$)的KLL概要(Karnin、Lang和Liberty,2016年)。然而,这两种算法仅在基于比较的模型下达到最优,而大多数典型应用场景涉及整数流,概要除了进行比较外还可利用这些整数信息。若超越基于比较的模型,确定性q-digest概要(Shrivastava、Buragohain、Agrawal和Suri,2004年)实现了$O(\varepsilon^{-1}\log U)$字空间复杂度,这与前述概要不可比较。长期存在一个未解问题:是否存在使用$O(\varepsilon^{-1})$字空间(只要$n \leq \mathrm{poly}(U)$即达到最优)的空间复杂度。本文提出一种使用$O(\varepsilon^{-1})$字空间的确定性算法,解决了这一系列研究问题。