Exponential histograms, with bins of the form $\left\{ \left(\rho^{k-1},\rho^{k}\right]\right\} _{k\in\mathbb{Z}}$, for $\rho>1$, straightforwardly summarize the quantiles of streaming data sets (Masson et al. 2019). While they guarantee the relative accuracy of their estimates, they appear to use only $\log n$ values to summarize $n$ inputs. We study four aspects of exponential histograms -- size, accuracy, occupancy, and largest gap size -- when inputs are i.i.d. $\mathrm{Exp}\left(\lambda\right)$ or i.i.d. $\mathrm{Pareto}\left(\nu,\beta\right)$, taking $\mathrm{Exp}\left(\lambda\right)$ (or, $\mathrm{Pareto}\left(\nu,\beta\right)$) to represent all light- (or, heavy-) tailed distributions. We show that, in these settings, size grows like $\log n$ and takes on a Gumbel distribution as $n$ grows large. We bound the missing mass to the right of the histogram and the mass of its final bin and show that occupancy grows apace with size. Finally, we approximate the size of the largest number of consecutive, empty bins. Our study gives a deeper and broader view of this low-memory approach to quantile estimation.
翻译:指数直方图,其箱体形式为 $\left\{ \left(\rho^{k-1},\rho^{k}\right]\right\} _{k\in\mathbb{Z}}$(其中 $\rho>1$),可直接概述流数据集的分位数(Masson 等人,2019)。尽管这类直方图能保证估计值的相对精度,但似乎仅用 $\log n$ 个值即可概括 $n$ 个输入。我们研究了指数直方图的四个维度——大小、精度、占用率及最大间隙大小——当输入为独立同分布的 $\mathrm{Exp}\left(\lambda\right)$ 或 $\mathrm{Pareto}\left(\nu,\beta\right)$ 时,分别以 $\mathrm{Exp}\left(\lambda\right)$(或 $\mathrm{Pareto}\left(\nu,\beta\right)$)代表所有轻尾(或重尾)分布。结果表明,在这些设定下,直方图大小按 $\log n$ 增长,且当 $n$ 趋于无穷时服从 Gumbel 分布。我们界定了直方图右侧的缺失质量及其最后一个箱体的质量,并显示占用率与大小同步增长。最后,我们近似计算了连续空箱的最大数量。本研究对这一低内存分位数估计方法提供了更深入、更全面的认识。