DP-S4S: Accurate and Scalable Select-Join-Aggregate Query Processing with User-Level Differential Privacy

Answering Select-Join-Aggregate queries with DP is a fundamental problem with important applications in various domains. The current SOTA methods ensure user-level DP (i.e., the adversary cannot infer the presence or absence of any given individual user with high confidence) and achieve instance-optimal accuracy on the query results. However, these solutions involve solving expensive optimization programs, which may incur prohibitive computational overhead for large databases. One promising direction to achieve scalability is through sampling, which provides a tunable trade-off between result utility and computational costs. However, applying sampling to differentially private SJA processing is a challenge for two reasons. First, it is unclear what to sample, in order to achieve the best accuracy within a given computational budget. Second, prior solutions were not designed with sampling in mind, and their mathematical tool chains are not sampling-friendly. To our knowledge, the only known solution that applies sampling to private SJA processing is S&E, a recent proposal that (i) samples users and (ii) combines sampling directly with existing solutions to enforce DP. We show that both are suboptimal designs; consequently, even with a relatively high sample rate, the error incurred by S&E can be 10x higher than the underlying DP mechanism without sampling. Motivated by this, we propose Differentially Private Sampling for Scale (DP-S4S), a novel mechanism that addresses the above challenges by (i) sampling aggregation units instead of users, and (ii) laying the mathematical foundation for SJA processing under RDP, which composes more easily with sampling. Further, DP-S4S can answer both scalar and vector SJA queries. Extensive experiments on real data demonstrate that DP-S4S enables scalable SJA processing on large datasets under user-level DP, while maintaining high result utility.

翻译：使用差分隐私（DP）回答选择-连接-聚集（SJA）查询是多个领域应用中的基础性问题。当前最先进方法确保用户级DP（即攻击者无法高置信度推断任意特定用户的存在与否），并在查询结果上实现实例最优精度。然而，这些方法需求解昂贵的优化程序，可能对大型数据库产生难以承受的计算开销。通过采样实现可扩展性是一条有前景的路径，它能在结果效用与计算成本间提供可调节的权衡。但将采样应用于满足差分隐私的SJA处理面临两大挑战：其一，如何在给定计算预算下通过采样实现最优精度尚不明确；其二，现有解决方案在设计时未考虑采样，其数学工具链对采样不友好。据我们所知，目前唯一将采样应用于隐私SJA处理的已知方法是S&E——最新提出的方案 (i) 采样用户并 (ii) 将采样与现有DP机制直接结合。我们证明这两种设计均非最优：即使采用较高采样率，S&E的误差仍可比无采样的底层DP机制高10倍。受此启发，我们提出面向规模化的差分隐私采样（DP-S4S），该新机制通过以下方式解决上述挑战：(i) 采样聚合单元而非用户；(ii) 为基于RDP（更易与采样组合）的SJA处理奠定数学基础。此外，DP-S4S可同时处理标量与向量SJA查询。基于真实数据的广泛实验表明，DP-S4S能在用户级DP约束下实现大规模数据集上的可扩展SJA处理，同时保持较高的结果效用。