We design, to the best of our knowledge, the first differentially private (DP) stream aggregation processing system at scale. Our system -- Differential Privacy SQL Pipelines (DP-SQLP) -- is built using a streaming framework similar to Spark streaming, and is built on top of the Spanner database and the F1 query engine from Google. Towards designing DP-SQLP we make both algorithmic and systemic advances, namely, we (i) design a novel (user-level) DP key selection algorithm that can operate on an unbounded set of possible keys, and can scale to one billion keys that users have contributed, (ii) design a preemptive execution scheme for DP key selection that avoids enumerating all the keys at each triggering time, and (iii) use algorithmic techniques from DP continual observation to release a continual DP histogram of user contributions to different keys over the stream length. We empirically demonstrate the efficacy by obtaining at least $16\times$ reduction in error over meaningful baselines we consider. We implemented a streaming differentially private user impressions for Google Shopping with DP-SQLP. The streaming DP algorithms are further applied to Google Trends.
翻译:据我们所知,我们设计了首个大规模差分隐私(DP)流聚合处理系统。该系统——差分隐私SQL管道(DP-SQLP)——采用类似Spark Streaming的流处理框架构建,并基于Google的Spanner数据库和F1查询引擎开发。为设计DP-SQLP,我们在算法和系统层面均取得进展,具体而言:(i)提出一种新颖的(用户级)DP键选择算法,该算法能处理无限可能的键集,并扩展到用户贡献的十亿级键规模;(ii)设计了一种抢占式执行方案用于DP键选择,避免在每个触发时刻枚举所有键;(iii)利用差分隐私持续观测技术,沿流长度发布用户贡献到不同键的持续差分隐私直方图。通过实验证明,与所考虑的有意义基线相比,该方法可至少降低16倍的误差。我们已使用DP-SQLP在Google Shopping中实现了流式差分隐私用户印象统计,并将流式DP算法进一步应用于Google Trends。