This work presents a highly optimized implementation of PAC-DB, a recent and promising database privacy model. We prove that our SIMD-PAC-DB can compute the same privatized answer with just a single query, instead of the 128 stochastic executions against different 50% database sub-samples needed by the original PAC-DB. Our key insight is that every bit of a hashed primary key can be seen to represent membership of such a sub-sample. We present new algorithms for approximate computation of stochastic aggregates based on these hashes, which, thanks to their SIMD-friendliness, run up to 40x faster than scalar equivalents. We release an open-source DuckDB community extension which includes a rewriter that PAC-privatizes arbitrary SQL queries. Our experiments on TPC-H, Clickbench, and SQLStorm evaluate thousands of queries in terms of performance and utility, significantly advancing the ease of use and functionality of privacy-aware data systems in practice.
翻译:本文提出了一种高度优化的PAC-DB实现方案,PAC-DB是近期提出的一种前景广阔的数据库隐私保护模型。我们证明,所提出的SIMD-PAC-DB仅需单次查询即可计算出相同的隐私化答案,而原始PAC-DB需要对不同的50%数据库子样本进行128次随机执行。我们的核心洞见在于:哈希主键的每个比特位均可视为此类子样本的成员标识。基于这些哈希值,我们提出了新的随机聚合近似计算算法,得益于其对SIMD架构的良好适应性,其运行速度比标量等效实现快达40倍。我们发布了开源的DuckDB社区扩展,其中包含可将任意SQL查询进行PAC隐私化重写的重写器。通过在TPC-H、Clickbench和SQLStorm数据集上对数千个查询进行性能与效用评估,我们的实验显著提升了隐私感知数据系统在实际应用中的易用性和功能性。