This work presents a highly optimized implementation of PAC-DB, a recent and promising database privacy model. We prove that our SIMD-PAC-DB can compute the same privatized answer with just a single query, instead of the 128 stochastic executions against different 50% database sub-samples needed by the original PAC-DB. Our key insight is that every bit of a hashed primary key can be seen to represent membership of such a sub-sample. We present new algorithms for approximate computation of stochastic aggregates based on these hashes, which, thanks to their SIMD-friendliness, run up to 40x faster than scalar equivalents. We release an open-source DuckDB community extension which includes a rewriter that PAC-privatizes arbitrary SQL queries. Our experiments on TPC-H, Clickbench, and SQLStorm evaluate thousands of queries in terms of performance and utility, significantly advancing the ease of use and functionality of privacy-aware data systems in practice.
翻译:本文给出了PAC-DB(一种近年前景广阔的数据库隐私模型)的高度优化实现。我们证明,SIMD-PAC-DB仅需一次查询即可计算出相同隐私化的答案,而非原始PAC-DB所需的针对128个不同50%数据库子样本的随机执行。我们的关键洞察是:哈希主键的每个比特位可被视为此类子样本的成员标识。基于这些哈希,我们提出了近似计算随机聚合的新算法。得益于其对SIMD友好的特性,该算法执行速度比标量算法快40倍。我们发布了一个开源的DuckDB社区扩展,其中包含可将任意SQL查询进行PAC隐私化的重写器。我们在TPC-H、Clickbench和SQLStorm上开展的实验评估了数千条查询的性能与效用,显著推动了隐私感知数据系统的易用性和功能性。