分布式数据流中的$L_p$采样及其在对抗鲁棒性中的应用 ($L_p$ Sampling in Distributed Data Streams with Applications to Adversarial Robustness)

In the distributed monitoring model, a data stream over a universe of size $n$ is distributed over $k$ servers, who must continuously provide certain statistics of the overall dataset, while minimizing communication with a central coordinator. In such settings, the ability to efficiently collect a random sample from the global stream is a powerful primitive, enabling a wide array of downstream tasks such as estimating frequency moments, detecting heavy hitters, or performing sparse recovery. Of particular interest is the task of producing a perfect $L_p$ sample, which given a frequency vector $f \in \mathbb{R}^n$, outputs an index $i$ with probability $\frac{f_i^p}{\|f\|_p^p}+\frac{1}{\mathrm{poly}(n)}$. In this paper, we resolve the problem of perfect $L_p$ sampling for all $p\ge 1$ in the distributed monitoring model. Specifically, our algorithm runs in $k^{p-1} \cdot \mathrm{polylog}(n)$ bits of communication, which is optimal up to polylogarithmic factors. Utilizing our perfect $L_p$ sampler, we achieve adversarially-robust distributed monitoring protocols for the $F_p$ moment estimation problem, where the goal is to provide a $(1+\varepsilon)$-approximation to $f_1^p+\ldots+f_n^p$. Our algorithm uses $\frac{k^{p-1}}{\varepsilon^2}\cdot\mathrm{polylog}(n)$ bits of communication for all $p\ge 2$ and achieves optimal bounds up to polylogarithmic factors, matching lower bounds by Woodruff and Zhang (STOC 2012) in the non-robust setting. Finally, we apply our framework to achieve near-optimal adversarially robust distributed protocols for central problems such as counting, frequency estimation, heavy-hitters, and distinct element estimation.

翻译：在分布式监控模型中，一个定义域大小为$n$的数据流被分布到$k$台服务器上，这些服务器必须持续提供整体数据集的特定统计量，同时最小化与中央协调器的通信开销。在此类场景中，从全局流中高效采集随机样本的能力是一种强大的原语，能够支持频率矩估计、重击者检测或稀疏恢复等多种下游任务。特别值得关注的是生成完美$L_p$样本的任务：给定频率向量$f \in \mathbb{R}^n$，该任务以概率$\frac{f_i^p}{\|f\|_p^p}+\frac{1}{\mathrm{poly}(n)}$输出索引$i$。本文完整解决了分布式监控模型中所有$p\ge 1$情况下的完美$L_p$采样问题。具体而言，我们的算法以$k^{p-1} \cdot \mathrm{polylog}(n)$比特的通信开销运行，该复杂度在多项式对数因子内达到最优。利用我们的完美$L_p$采样器，我们为$F_p$矩估计问题实现了对抗鲁棒的分布式监控协议，其目标是对$f_1^p+\ldots+f_n^p$提供$(1+\varepsilon)$近似。对于所有$p\ge 2$，我们的算法使用$\frac{k^{p-1}}{\varepsilon^2}\cdot\mathrm{polylog}(n)$比特的通信量，在多项式对数因子内达到最优界，与非鲁棒场景下Woodruff和Zhang（STOC 2012）建立的下界相匹配。最后，我们将该框架应用于计数、频率估计、重击者检测和相异元素估计等核心问题，实现了近乎最优的对抗鲁棒分布式协议。