When analysing Differentially Private (DP) machine learning pipelines, the potential privacy cost of data-dependent pre-processing is frequently overlooked in privacy accounting. In this work, we propose a general framework to evaluate the additional privacy cost incurred by non-private data-dependent pre-processing algorithms. Our framework establishes upper bounds on the overall privacy guarantees by utilising two new technical notions: a variant of DP termed Smooth DP and the bounded sensitivity of the pre-processing algorithms. In addition to the generic framework, we provide explicit overall privacy guarantees for multiple data-dependent pre-processing algorithms, such as data imputation, quantization, deduplication and PCA, when used in combination with several DP algorithms. Notably, this framework is also simple to implement, allowing direct integration into existing DP pipelines.
翻译:在分析差分隐私(DP)机器学习流水线时,数据依赖预处理的潜在隐私成本在隐私核算中常被忽视。本文提出一个通用框架,用于评估非私有数据依赖预处理算法所产生的额外隐私成本。该框架通过利用两个新的技术概念——称为平滑DP的DP变体以及预处理算法的有界敏感度——来建立整体隐私保证的上界。除通用框架外,我们针对多种数据依赖预处理算法(如数据插补、量化、去重和PCA)与多种DP算法组合使用的情况,提供了明确的整体隐私保证。值得注意的是,该框架实现简单,可直接集成到现有DP流水线中。