We propose a novel framework for statistical estimation on noisy datasets. Within this framework, we focus on the frequency moments ($F_p$) problem and demonstrate that it is possible to approximate $F_p$ of the unknown ground-truth dataset using sublinear space in the data stream model and sublinear communication in the coordinator model, provided that the approximation ratio is parameterized by a data-dependent quantity, which we call the $F_p$-mismatch-ambiguity. We also establish a set of lower bounds, which are tight in terms of the input size. Our results yield several interesting insights: (1) In the data stream model, the $F_p$ problem is inherently more difficult in the noisy setting than in the noiseless one. In particular, while $F_2$ can be approximated in logarithmic space in terms of the input size in the noiseless setting, any algorithm for $F_2$ in the noisy setting requires polynomial space. (2) In the coordinator model, in sharp contrast to the noiseless case, achieving polylogarithmic communication in the input size is generally impossible for $F_p$ under noise. However, when the $F_p$ mismatch ambiguity falls below a certain threshold, it becomes possible to achieve communication that is entirely independent of the input size.
翻译:我们提出了一种用于噪声数据集统计估计的新框架。在此框架内,我们聚焦于频率矩($F_p$)问题,并证明在数据流模型中使用亚线性空间、在协调器模型中使用亚线性通信即可近似未知真实数据集的$F_p$,前提是近似比由一种数据依赖量——我们称之为$F_p$-失配模糊度——进行参数化。我们还建立了一组关于输入规模紧致的下界。我们的研究结果揭示了若干重要发现:(1)在数据流模型中,噪声环境下的$F_p$问题本质上比无噪声情形更为困难。具体而言,虽然无噪声环境下$F_2$可在输入规模的对数空间内近似,但噪声环境下任何$F_2$算法都需要多项式空间。(2)在协调器模型中,与无噪声情形形成鲜明对比的是,噪声环境下$F_p$通常无法实现输入规模的多对数通信复杂度。然而,当$F_p$失配模糊度低于特定阈值时,则可能实现完全独立于输入规模的通信开销。