Quantiles, such as the median or percentiles, provide concise and useful information about the distribution of a collection of items, drawn from a totally ordered universe. We study data structures, called quantile summaries, which keep track of all quantiles, up to an error of at most $\varepsilon$. That is, an $\varepsilon$-approximate quantile summary first processes a stream of items and then, given any quantile query $0\le φ\le 1$, returns an item from the stream, which is a $φ'$-quantile for some $φ' = φ\pm \varepsilon$. We focus on comparison-based quantile summaries that can only compare two items and are otherwise completely oblivious of the universe. The best such deterministic quantile summary to date, due to Greenwald and Khanna (SIGMOD '01), stores at most $O(\frac{1}{\varepsilon}\cdot \log \varepsilon N)$ items, where $N$ is the number of items in the stream. We prove that this space bound is optimal by showing a matching lower bound. Our result thus rules out the possibility of constructing a deterministic comparison-based quantile summary in space $f(\varepsilon)\cdot o(\log N)$, for any function $f$ that does not depend on $N$. As a corollary, we improve the lower bound for biased quantiles, which provide a stronger, relative-error guarantee of $(1\pm \varepsilon)\cdot φ$, and for other related computational tasks.
翻译:分位数,如中位数或百分位数,提供了从全序宇宙中抽取的项目集合分布的简洁且有用的信息。我们研究称为分位数摘要的数据结构,该结构能够跟踪所有分位数,误差至多为$\varepsilon$。即,一个$\varepsilon$近似分位数摘要首先处理一个项目流,然后给定任意分位数查询$0\le φ\le 1$,返回流中的一个项目,该项目是某个$φ' = φ\pm \varepsilon$的$φ'$分位数。我们专注于基于比较的分位数摘要,此类摘要只能比较两个项目,除此之外对宇宙完全不可知。迄今为止,由Greenwald和Khanna(SIGMOD '01)提出的最佳确定性分位数摘要最多存储$O(\frac{1}{\varepsilon}\cdot \log \varepsilon N)$个项目,其中$N$是流中的项目数量。我们通过证明一个匹配的下界,证实该空间界限是最优的。因此,我们的结果排除了以空间$f(\varepsilon)\cdot o(\log N)$(其中$f$是不依赖于$N$的任何函数)构建确定性基于比较的分位数摘要的可能性。作为推论,我们改进了有偏分位数(其提供更强的相对误差保证$(1\pm \varepsilon)\cdot φ$)以及其他相关计算任务的下界。