We revisit the problem of estimating the profile (also known as the rarity) in the data stream model. Given a sequence of $m$ elements from a universe of size $n$, its profile is a vector $\phi$ whose $i$-th entry $\phi_i$ represents the number of distinct elements that appear in the stream exactly $i$ times. A classic paper by Datar and Muthukrishan from 2002 gave an algorithm which estimates any entry $\phi_i$ up to an additive error of $\pm \epsilon D$ using $O(1/\epsilon^2 (\log n + \log m))$ bits of space, where $D$ is the number of distinct elements in the stream. In this paper, we considerably improve on this result by designing an algorithm which simultaneously estimates many coordinates of the profile vector $\phi$ up to small overall error. We give an algorithm which, with constant probability, produces an estimated profile $\hat\phi$ with the following guarantees in terms of space and estimation error: - For any constant $\tau$, with $O(1 / \epsilon^2 + \log n)$ bits of space, $\sum_{i=1}^\tau |\phi_i - \hat\phi_i| \leq \epsilon D$. - With $O(1/ \epsilon^2\log (1/\epsilon) + \log n + \log \log m)$ bits of space, $\sum_{i=1}^m |\phi_i - \hat\phi_i| \leq \epsilon m$. In addition to bounding the error across multiple coordinates, our space bounds separate the terms that depend on $1/\epsilon$ and those that depend on $n$ and $m$. We prove matching lower bounds on space in both regimes. Application of our profile estimation algorithm gives estimates within error $\pm \epsilon D$ of several symmetric functions of frequencies in $O(1/\epsilon^2 + \log n)$ bits. This generalizes space-optimal algorithms for the distinct elements problems to other problems including estimating the Huber and Tukey losses as well as frequency cap statistics.
翻译:我们重新审视数据流模型中轮廓(亦称稀有度)的估计问题。给定一个来自规模为$n$的全域、长度为$m$的元素序列,其轮廓是一个向量$\phi$,其中第$i$个分量$\phi_i$表示在流中恰好出现$i$次的不同元素数量。Datar和Muthukrishnan于2002年发表的一篇经典论文给出了一个算法,该算法能以$O(1/\epsilon^2 (\log n + \log m))$比特空间估计任意分量$\phi_i$,其加性误差为$\pm \epsilon D$,其中$D$是流中不同元素的数量。本文通过设计一个能同时估计轮廓向量$\phi$的多个坐标且整体误差小的算法,显著改进了这一结果。我们给出的算法能以常数概率产生一个估计轮廓$\hat\phi$,并在空间与估计误差方面满足以下保证:
- 对任意常数$\tau$,使用$O(1 / \epsilon^2 + \log n)$比特空间,使得$\sum_{i=1}^\tau |\phi_i - \hat\phi_i| \leq \epsilon D$。
- 使用$O(1/ \epsilon^2\log (1/\epsilon) + \log n + \log \log m)$比特空间,使得$\sum_{i=1}^m |\phi_i - \hat\phi_i| \leq \epsilon m$。
除了对多个坐标的误差进行约束外,我们的空间界还将依赖$1/\epsilon$的项与依赖$n$和$m$的项分离。我们证明了这两种情形下空间复杂度的匹配下界。应用我们的轮廓估计算法,可在$O(1/\epsilon^2 + \log n)$比特空间内,以误差$\pm \epsilon D$估计频率的若干对称函数。这将对不同元素问题的空间最优算法推广至其他问题,包括估计Huber损失、Tukey损失以及频率上限统计量。