Identifying heavy hitters in data streams is a fundamental problem with widespread applications in modern analytics systems. These streams are often derived from sensitive user activity, making update-level privacy guarantees necessary. While recent work has adapted the classical heavy hitter algorithm Misra-Gries to satisfy differential privacy in the streaming model, the privatization of other heavy hitter algorithms with better empirical utility is absent. Under this observation, we present the first differentially private variant of the SpaceSaving algorithm, which, in the non-private setting, is regarded as the state-of-the-art in practice. Our construction post-processes a non-private SpaceSaving summary by injecting asymptotically optimal noise and applying a carefully calibrated selection rule that suppresses unstable labels. This yields strong privacy guarantees while preserving the empirical advantages of SpaceSaving. Second, we introduce a generic method for extracting heavy hitters from any differentially private frequency oracle in the data stream model. The method requires only O(k) additional memory, where k is the number of heavy items, and provides a mechanism for safely releasing item identities from noisy frequency estimates. This yields an efficient, plug-and-play approach for private heavy hitter recovery from linear sketches. Finally, we conduct an experimental evaluation on synthetic and real-world datasets. Across a wide range of privacy parameters and space budgets, our method provides superior utility to the existing differentially private Misra-Gries algorithm. Our results demonstrate that the empirical superiority of SpaceSaving survives privatization and that efficient, practical heavy hitter identification is achievable under strong differential privacy guarantees.
翻译:在数据流中识别高频项是现代分析系统中具有广泛应用的基础性问题。这些数据流通常来源于敏感用户活动,因此需要提供更新级别的隐私保护。尽管近期研究已将经典的高频项识别算法Misra-Gries适配至流式模型以满足差分隐私要求,但其他具有更优实证效用的高频项算法的隐私化方案仍属空白。基于此观察,我们提出了SpaceSaving算法的首个差分隐私变体——该算法在非隐私设置下被视为实践中的最优方法。我们的构建通过对非隐私的SpaceSaving摘要注入渐近最优噪声,并采用经过精细校准的抑制不稳定标签的选择规则,在保持SpaceSaving算法实证优势的同时提供了强隐私保障。其次,我们提出了一种通用方法,用于从数据流模型中任意差分隐私频率预言机提取高频项。该方法仅需O(k)的额外内存(其中k为高频项数量),并提供了一种从带噪频率估计中安全释放项标识的机制,从而实现了基于线性草图的私有高频项恢复的高效即插即用方案。最后,我们在合成数据集和真实数据集上进行了实验评估。在广泛的隐私参数和空间预算范围内,我们的方法相较于现有的差分隐私Misra-Gries算法展现出更优的效用。实验结果表明,SpaceSaving算法的实证优越性在隐私化后得以保留,且在强差分隐私保障下仍可实现高效实用的高频项识别。