Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset and these platforms search for augmentations (join or union compatible datasets) that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets. We present Saibot, a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that Saibot can return augmentations that achieve model accuracy within 50 to 90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.
翻译:近期数据搜索平台采用基于机器学习任务的效用度量(而非基于元数据的关键词)来搜索大型数据集集合。请求者提交训练数据集后,这些平台会搜索可扩充数据集(支持连接或并集操作的数据集),当用于扩充请求者数据集时,能最大程度提升模型(如线性回归)性能。尽管效果显著,但管理个人身份数据的提供者在授予平台数据访问权限前,要求提供差分隐私(DP)保证。然而,实现数据搜索的差分隐私并非易事——单次搜索可能涉及成百上千次数据集训练与评估,迅速耗尽隐私预算。我们提出Saibot,一种采用新型DP机制——因式分解隐私机制(FPM)的差分隐私数据搜索平台。FPM通过计算不同数据集组合上的充分半环统计量,实现一次隐私化处理后即可在搜索中自由重用。这使得Saibot可扩展至任意数量的数据集与请求,同时最大限度降低DP噪声对搜索结果的影响。我们针对常见数据扩充操作优化了FPM的敏感度,并分析了其在线性回归中的特性:具体而言,设计了多对多连接的无偏估计器并证明其边界,同时提出DP噪声重分配优化方法以降低对模型精度的影响。基于包含329个数据集的真实语料库评估表明,Saibot返回的扩充方案可使模型精度达到非隐私搜索的50%-90%,而现有替代DP机制(TPM、APM、混洗)的性能低数个数量级。