We design, to the best of our knowledge, the first differentially private (DP) stream aggregation processing system at scale. Our system -- Differential Privacy SQL Pipelines (DP-SQLP) -- is built using a streaming framework similar to Spark streaming, and is built on top of the Spanner database and the F1 query engine from Google. Towards designing DP-SQLP we make both algorithmic and systemic advances, namely, we (i) design a novel (user-level) DP key selection algorithm that can operate on an unbounded set of possible keys, and can scale to one billion keys that users have contributed, (ii) design a preemptive execution scheme for DP key selection that avoids enumerating all the keys at each triggering time, and (iii) use algorithmic techniques from DP continual observation to release a continual DP histogram of user contributions to different keys over the stream length. We empirically demonstrate the efficacy by obtaining at least $16\times$ reduction in error over meaningful baselines we consider. We implemented a streaming differentially private user impressions for Google Shopping with DP-SQLP. The streaming DP algorithms are further applied to Google Trends.
翻译:据我们所知,我们设计并实现了首个大规模差分隐私(DP)流聚合处理系统。我们的系统——差分隐私SQL流水线(DP-SQLP)——采用类似Spark Streaming的流处理框架构建,并基于谷歌的Spanner数据库和F1查询引擎。在设计DP-SQLP的过程中,我们在算法和系统层面均取得了进展,具体包括:(i)设计了一种新颖的(用户级)DP键值选择算法,该算法能够在无限可能的键值集合上运行,并可扩展至用户贡献的十亿级键值规模;(ii)设计了一种DP键值选择的抢占式执行方案,避免了每次触发时枚举所有键值;(iii)采用DP持续观测中的算法技术,持续发布用户在整个流长度中对不同键值贡献的DP直方图。我们通过实证证明,相较于所考虑的有意义基线,该系统的误差至少降低了$16\times$。我们已使用DP-SQLP为谷歌购物实现了流式差分隐私用户印象统计。该流式DP算法进一步应用于谷歌趋势分析。