Top-$k$ frequent items detection is a fundamental task in data stream mining. Many promising solutions are proposed to improve memory efficiency while still maintaining high accuracy for detecting the Top-$k$ items. Despite the memory efficiency concern, the users could suffer from privacy loss if participating in the task without proper protection, since their contributed local data streams may continually leak sensitive individual information. However, most existing works solely focus on addressing either the memory-efficiency problem or the privacy concerns but seldom jointly, which cannot achieve a satisfactory tradeoff between memory efficiency, privacy protection, and detection accuracy. In this paper, we present a novel framework HG-LDP to achieve accurate Top-$k$ item detection at bounded memory expense, while providing rigorous local differential privacy (LDP) protection. Specifically, we identify two key challenges naturally arising in the task, which reveal that directly applying existing LDP techniques will lead to an inferior ``accuracy-privacy-memory efficiency'' tradeoff. Therefore, we instantiate three advanced schemes under the framework by designing novel LDP randomization methods, which address the hurdles caused by the large size of the item domain and by the limited space of the memory. We conduct comprehensive experiments on both synthetic and real-world datasets to show that the proposed advanced schemes achieve a superior ``accuracy-privacy-memory efficiency'' tradeoff, saving $2300\times$ memory over baseline methods when the item domain size is $41,270$. Our code is open-sourced via the link.
翻译:Top-$k$频繁项检测是数据流挖掘中的基本任务。现有研究提出了多种在保持高准确率的同时提升内存效率的解决方案。然而,尽管关注内存效率,参与该任务的用户若无适当保护将面临隐私泄露风险,因为其贡献的本地数据流可能持续泄露敏感个人信息。现有工作大多仅单独解决内存效率问题或隐私保护问题,鲜有兼顾两者,因而无法实现内存效率、隐私保护与检测准确率三者间的平衡。本文提出名为HG-LDP的新型框架,在提供严格本地化差分隐私保护的同时,以有限内存开销实现高精度Top-$k$项检测。具体而言,我们识别出该任务中自然产生的两个关键挑战,表明直接应用现有LDP技术将导致"准确率-隐私-内存效率"的次优权衡。为此,我们通过设计新型LDP随机化方法,在该框架下实现了三种先进方案,成功克服了项目域规模过大和内存空间有限带来的困难。在合成数据集与真实数据集上的全面实验表明,所提出的先进方案实现了更优的"准确率-隐私-内存效率"平衡——当项目域规模为41,270时,其内存消耗仅为基线方法的1/2300。相关代码已通过链接开源。