Benchmarking is crucial for evaluating a DBMS, yet existing benchmarks often fail to reflect the varied nature of user workloads. As a result, there is increasing momentum toward creating databases that incorporate real-world user data to more accurately mirror business environments. However, privacy concerns deter users from directly sharing their data, underscoring the importance of creating synthesized databases for benchmarking that also prioritize privacy protection. Differential privacy has become a key method for safeguarding privacy when sharing data, but the focus has largely been on minimizing errors in aggregate queries or classification tasks, with less attention given to benchmarking factors like runtime performance. This paper delves into the creation of privacy-preserving databases specifically for benchmarking, aiming to produce a differentially private database whose query performance closely resembles that of the original data. Introducing PrivBench, an innovative synthesis framework, we support the generation of high-quality data that maintains privacy. PrivBench uses sum-product networks (SPNs) to partition and sample data, enhancing data representation while securing privacy. The framework allows users to adjust the detail of SPN partitions and privacy settings, crucial for customizing privacy levels. We validate our approach, which uses the Laplace and exponential mechanisms, in maintaining privacy. Our tests show that PrivBench effectively generates data that maintains privacy and excels in query performance, consistently reducing errors in query execution time, query cardinality, and KL divergence.
翻译:基准测试对于评估数据库管理系统至关重要,然而现有基准测试往往无法反映用户工作负载的多样性。为此,学术界日益倾向于构建融合真实用户数据的数据库,以更精准地模拟业务环境。但隐私问题阻碍了用户直接共享数据,这凸显了在创建用于基准测试的合成数据库时优先保护隐私的重要性。差分隐私已成为数据共享中保护隐私的关键技术,但现有研究主要聚焦于最小化聚合查询或分类任务中的误差,对运行时性能等基准测试因素关注不足。本文专门针对基准测试场景研究隐私保护数据库的构建,旨在生成查询性能与原始数据高度相似的差分隐私数据库。我们提出创新性合成框架PrivBench,支持在保障隐私的前提下生成高质量数据。该框架采用和积网络(SPNs)进行数据分区与采样,在保护隐私的同时增强数据表示能力。用户可根据需求调整SPN分区的粒度与隐私预算,这对定制隐私保护级别至关重要。我们通过拉普拉斯机制和指数机制验证了所提方法的隐私保护有效性。实验结果表明,PrivBench能有效生成兼具隐私保护与卓越查询性能的数据,在查询执行时间、查询基数和KL散度等指标上持续降低误差。