DP-S4S: Accurate and Scalable Select-Join-Aggregate Query Processing with User-Level Differential Privacy

Answering Select-Join-Aggregate queries with DP is a fundamental problem with important applications in various domains. The current SOTA methods ensure user-level DP (i.e., the adversary cannot infer the presence or absence of any given individual user with high confidence) and achieve instance-optimal accuracy on the query results. However, these solutions involve solving expensive optimization programs, which may incur prohibitive computational overhead for large databases. One promising direction to achieve scalability is through sampling, which provides a tunable trade-off between result utility and computational costs. However, applying sampling to differentially private SJA processing is a challenge for two reasons. First, it is unclear what to sample, in order to achieve the best accuracy within a given computational budget. Second, prior solutions were not designed with sampling in mind, and their mathematical tool chains are not sampling-friendly. To our knowledge, the only known solution that applies sampling to private SJA processing is S&E, a recent proposal that (i) samples users and (ii) combines sampling directly with existing solutions to enforce DP. We show that both are suboptimal designs; consequently, even with a relatively high sample rate, the error incurred by S&E can be 10x higher than the underlying DP mechanism without sampling. Motivated by this, we propose Differentially Private Sampling for Scale (DP-S4S), a novel mechanism that addresses the above challenges by (i) sampling aggregation units instead of users, and (ii) laying the mathematical foundation for SJA processing under RDP, which composes more easily with sampling. Further, DP-S4S can answer both scalar and vector SJA queries. Extensive experiments on real data demonstrate that DP-S4S enables scalable SJA processing on large datasets under user-level DP, while maintaining high result utility.

翻译：在差分隐私约束下处理选择-连接-聚合查询是一个基础性问题，在各领域具有重要应用价值。当前最优方法能确保用户级差分隐私（即攻击者无法以高置信度推断任何特定个体用户的存在与否），并在查询结果上实现实例最优精度。然而，这些解决方案需要求解复杂的优化问题，对于大规模数据库可能产生难以承受的计算开销。实现可扩展性的一个可行方向是通过采样技术，在结果效用与计算成本之间提供可调节的权衡。但将采样应用于差分隐私SJA处理面临双重挑战：首先，在给定计算预算下，为达到最佳精度应采样何种对象尚不明确；其次，现有解决方案并非为采样场景设计，其数学工具链对采样不友好。据我们所知，目前唯一将采样应用于隐私SJA处理的解决方案是近期提出的S&E方法，该方法（i）对用户进行采样，（ii）直接将采样与现有差分隐私执行方案结合。我们证明这两种设计均非最优：即便采用较高采样率，S&E产生的误差仍可能达到无采样基础差分隐私机制的10倍以上。基于此，我们提出差分隐私可扩展采样机制DP-S4S，通过以下创新应对上述挑战：（i）改采聚合单元而非用户，（ii）在RDP框架下为SJA处理建立数学基础，该框架更易于与采样技术结合。此外，DP-S4S能同时处理标量与向量SJA查询。在真实数据上的大量实验表明，DP-S4S能在用户级差分隐私约束下实现大规模数据集的可扩展SJA处理，同时保持优异的结果效用。