Privacy is a central challenge for systems that learn from sensitive data sets, especially when a system's outputs must be continuously updated to reflect changing data. We consider the achievable error for differentially private continual release of a basic statistic -- the number of distinct items -- in a stream where items may be both inserted and deleted (the turnstile model). With only insertions, existing algorithms have additive error just polylogarithmic in the length of the stream $T$. We uncover a much richer landscape in the turnstile model, even without considering memory restrictions. We show that every differentially private mechanism that handles insertions and deletions has worst-case additive error at least $T^{1/4}$ even under a relatively weak, event-level privacy definition. Then, we identify a parameter of the input stream, its maximum flippancy, that is low for natural data streams and for which we give tight parameterized error guarantees. Specifically, the maximum flippancy is the largest number of times that the contribution of a single item to the distinct elements count changes over the course of the stream. We present an item-level differentially private mechanism that, for all turnstile streams with maximum flippancy $w$, continually outputs the number of distinct elements with an $O(\sqrt{w} \cdot poly\log T)$ additive error, without requiring prior knowledge of $w$. We prove that this is the best achievable error bound that depends only on $w$, for a large range of values of $w$. When $w$ is small, the error of our mechanism is similar to the polylogarithmic in $T$ error in the insertion-only setting, bypassing the hardness in the turnstile model.
翻译:隐私是处理敏感数据集的学习系统面临的核心挑战,特别是在系统输出需要持续更新以反映数据变化时。我们研究了在允许元素插入和删除的流式数据(转门模型)中,基本统计量——不同元素数量——的差分隐私持续发布的可实现误差。在仅含插入操作的情况下,现有算法的附加误差仅为流长度$T$的多对数级。我们发现转门模型(即使不考虑内存限制)呈现出更丰富的特性:每个处理插入和删除的差分隐私机制在最坏情况下的附加误差至少为$T^{1/4}$(即使采用相对较弱的事件级隐私定义)。随后,我们识别出输入流的一个参数——最大波动性——该参数在自然数据流中取值较低,并针对该参数给出了紧致的参数化误差保证。具体而言,最大波动性是指流过程中单个元素对不同元素计数的贡献发生变化的最大次数。我们提出了一种项目级差分隐私机制,对于所有最大波动性为$w$的转门流,该机制能以$O(\sqrt{w} \cdot poly\log T)$的附加误差持续输出不同元素数量,且无需预先知道$w$值。我们证明对于$w$的广泛取值区间,该误差界是基于$w$的最优可达结果。当$w$较小时,我们的机制误差与仅插入场景中的多对数$T$误差接近,从而规避了转门模型的固有困难。