In many real-world scenarios, multiple data providers need to collaboratively perform analysis of their private data. The challenges of these applications, especially at the big data scale, are time and resource efficiency as well as end-to-end privacy with minimal loss of accuracy. Existing approaches rely primarily on cryptography, which improves privacy, but at the expense of query response time. However, current big data analytics frameworks require fast and accurate responses to large-scale queries, making cryptography-based solutions less suitable. In this work, we address the problem of combining Approximate Query Processing (AQP) and Differential Privacy (DP) in a private federated environment answering range queries on horizontally partitioned multidimensional data. We propose a new approach that considers a data distribution-aware online sampling technique to accelerate the execution of range queries and ensure end-to-end data privacy during and after analysis with minimal loss in accuracy. Through empirical evaluation, we show that our solution is able of providing up to 8 times faster processing than the basic non-secure solution while maintaining accuracy, formal privacy guarantees and resilience to learning-based attacks.
翻译:在许多现实场景中,多个数据提供方需要协作分析其私有数据。这类应用面临的挑战,尤其是在大数据规模下,在于时间与资源效率以及端到端的隐私保护,同时需尽可能减少精度损失。现有方法主要依赖于密码学技术,虽然提升了隐私性,但以查询响应时间为代价。然而,当前的大数据分析框架要求对大规模查询提供快速而准确的响应,这使得基于密码学的解决方案不太适用。在本工作中,我们研究了在隐私保护的联邦环境中,结合近似查询处理与差分隐私技术,以回答水平划分的多维数据范围查询的问题。我们提出了一种新方法,采用数据分布感知的在线采样技术来加速范围查询的执行,并在分析期间及分析后确保端到端的数据隐私,同时将精度损失降至最低。通过实证评估,我们证明所提解决方案能够提供比基础非安全方案快达8倍的处理速度,同时保持准确性、形式化的隐私保证以及对基于学习攻击的抵御能力。