Privacy is a central challenge for systems that learn from sensitive data sets, especially when a system's outputs must be continuously updated to reflect changing data. We consider the achievable error for the differentially private continual release of a basic statistic -- the number of distinct items -- in a stream where items may be both inserted and deleted (the turnstile model). With only insertions, existing algorithms have additive error just polylogarithmic in the length of the stream $T$. We uncover a much richer landscape in the turnstile model, even without considering memory restrictions. We show that any differentially private mechanism that handles insertions and deletions has worst-case additive error at least $T^{1/4}$ even under a relatively weak, event-level privacy definition. Then, we identify a property of the input stream, its maximum flippancy, that is low for natural data streams and for which one can give tight parameterized error guarantees. Specifically, the maximum flippancy is the largest number of times the count of a single item changes from a positive number to zero over the course of the stream. We present an item-level differentially private mechanism that, for all turnstile streams with maximum flippancy $w$, continually outputs the number of distinct elements with an $O(\sqrt{w} \cdot \mathsf{poly}\log T)$ additive error, without requiring prior knowledge of $w$. This is the best achievable error bound that depends only on $w$, for a large range of values of $w$. When $w$ is small, our mechanism provides similar error guarantees to the polylogarithmic in $T$ guarantees in the insertion-only setting, bypassing the hardness in the turnstile model.
翻译:隐私是从敏感数据集学习的系统面临的核心挑战,尤其是当系统输出需持续更新以反映数据变化时。我们研究了在允许插入和删除的数据流(转门模型)中,持续发布基础统计量——不同元素数量——的差分隐私机制可实现误差。仅考虑插入操作时,现有算法的附加误差仅为数据流长度$T$的多对数函数。我们发现转门模型中存在更丰富的场景,即使不考虑内存限制也是如此。研究表明,任何处理插入和删除操作的差分隐私机制在最坏情况下的附加误差至少为$T^{1/4}$,即使采用相对较弱的事件级隐私定义。随后,我们识别出数据流的一个属性——最大翻转度,该属性在自然数据流中较低,且可据此给出参数化的紧致误差保证。具体而言,最大翻转度是数据流中单个元素计数从正数变为零的最大次数。我们提出了一种项级差分隐私机制,对于所有最大翻转度为$w$的转门数据流,该机制可持续输出不同元素数量,附加误差为$O(\sqrt{w} \cdot \mathsf{poly}\log T)$,且无需预先知道$w$。对于大范围的$w$值,这是仅依赖$w$的最佳可达误差界。当$w$较小时,该机制在转门模型中提供了与仅插入场景下多对数级$T$误差保证相当的误差性能,从而绕开了转门模型的计算难度。