This work presents a highly optimized implementation of PAC-DB, a recent and promising database privacy model. We prove that our SIMD-PAC-DB can compute the same privatized answer with just a single query, instead of the 128 stochastic executions against different 50% database sub-samples needed by the original PAC-DB. Our key insight is that every bit of a hashed primary key can be seen to represent membership of such a sub-sample. We present new algorithms for approximate computation of stochastic aggregates based on these hashes, which, thanks to their SIMD-friendliness, run up to 40x faster than scalar equivalents. We release an open-source DuckDB community extension which includes a rewriter that PAC-privatizes arbitrary SQL queries. Our experiments on TPC-H, Clickbench, and SQLStorm evaluate thousands of queries in terms of performance and utility, significantly advancing the ease of use and functionality of privacy-aware data systems in practice.
翻译:本文提出了PAC-DB(一种近期提出的有前景的数据库隐私模型)的高优化实现方案。我们证明,SIMD-PAC-DB仅需一次查询即可计算出相同隐私化的答案,而原始PAC-DB需要对128次不同50%数据库子样本进行随机执行。我们的关键洞察在于:哈希主键的每个比特位均可视作此类子样本的成员标识。基于这些哈希值,我们提出了随机聚合的近似计算新算法,得益于其SIMD友好特性,该算法比标量等效实现快达40倍。我们开源了一个DuckDB社区扩展,其中包含可将任意SQL查询PAC隐私化的重写器。在TPC-H、Clickbench和SQLStorm上的实验评估了数千次查询的性能与效用,显著提升了实践中隐私感知数据系统的易用性和功能性。