Optimally detecting uniformly-distributed $\ell_2$ heavy hitters in data streams

from arxiv, In this version, we remove our previous result for random-order streams after becoming aware that it had already appeared in prior work by Braverman, Garg, and Woodruff [2020]. We now properly cite their result and present only our new contribution concerning partially random-order streams

Given a stream $x_1,x_2,\dots,x_n$ of items from a Universe $U$ of size poly$(n)$, and a parameter $ε>0$, an item $i\in U$ is said to be an $\ell_2$ heavy hitter if its frequency $f_i$ in the stream is at least $\sqrt{εF_2}$, where $F_2={\sum_{i\in U} f_i^2}$. Efficiently detecting such heavy hitters is a fundamental problem in data streams and has several applications in both theory and in practice. The classical $\mathsf{CountSketch}$ algorithm due to Charikar, Chen, and Farach-Colton [2004], was the first algorithm to detect $\ell_2$ heavy hitters using $O\left(\frac{\log^2 n}ε\right)$ bits of space, and their algorithm is optimal for streams with deletions. A work due to Braverman, Chestnut, Ivkin, Nelson, Wang, and Woodruff [2017] gave the $\mathsf{BPTree}$ algorithm which detects $\ell_2$ heavy hitters in insertion-only streams using only $O\left(\frac{\log(1/ε)}ε\log n \right)$ space. Note that any algorithm requires at least $Ω\left(\frac{1}ε \log n\right)$ space to output $O(1/ε)$ heavy hitters in the worst case. While $\mathsf{BPTree}$ achieves optimal space bound for constant $ε$, their bound could be sub-optimal for $ε=o(1)$. For $\textit{random order}$ streams, where the stream elements can be adversarial but their order of arrival is uniformly random, Braverman, Garg, and Woodruff [2020] showed that it is possible to achieve the optimal space bound of $O\left(\frac{1}ε \log n\right)$ for every $ε= Ω\left(\frac{1}{2^{\sqrt{\log n}}}\right)$. In this work, we generalize their result to $\textit{partially random order}$ streams where only the heavy hitters are required to be uniformly distributed in the stream. We show that it is possible to achieve the same space bound, but with an additional assumption that the algorithm is given a constant approximation to $F_2$ in advance.

翻译：给定一个来自规模为 poly$(n)$ 的论域 $U$ 的数据流 $x_1,x_2,\dots,x_n$，以及参数 $ε>0$，若项 $i\in U$ 在流中的频率 $f_i$ 至少为 $\sqrt{εF_2}$，其中 $F_2={\sum_{i\in U} f_i^2}$，则称该项为 $\ell_2$ 频繁项。高效检测此类频繁项是数据流中的一个基本问题，在理论和实践中均有多种应用。由 Charikar、Chen 和 Farach-Colton [2004] 提出的经典 $\mathsf{CountSketch}$ 算法是首个使用 $O\left(\frac{\log^2 n}ε\right)$ 比特空间检测 $\ell_2$ 频繁项的算法，并且对于包含删除操作的流，该算法是最优的。Braverman、Chestnut、Ivkin、Nelson、Wang 和 Woodruff [2017] 的工作提出了 $\mathsf{BPTree}$ 算法，该算法仅使用 $O\left(\frac{\log(1/ε)}ε\log n \right)$ 空间在仅插入流中检测 $\ell_2$ 频繁项。注意，在最坏情况下，任何算法至少需要 $Ω\left(\frac{1}ε \log n\right)$ 空间来输出 $O(1/ε)$ 个频繁项。虽然 $\mathsf{BPTree}$ 在 $ε$ 为常数时达到了最优空间界，但对于 $ε=o(1)$，其界可能是次优的。对于 $\textit{随机顺序}$ 流，即流元素可以是敌意的，但其到达顺序是均匀随机的，Braverman、Garg 和 Woodruff [2020] 表明，对于每个 $ε= Ω\left(\frac{1}{2^{\sqrt{\log n}}}\right)$，都有可能实现 $O\left(\frac{1}ε \log n\right)$ 的最优空间界。在本工作中，我们将他们的结果推广到 $\textit{部分随机顺序}$ 流，其中仅要求频繁项在流中均匀分布。我们证明，在算法预先获得 $F_2$ 的一个常数近似值的额外假设下，可以实现相同的空间界。