When analysing Differentially Private (DP) machine learning pipelines, the potential privacy cost of data-dependent pre-processing is frequently overlooked in privacy accounting. In this work, we propose a general framework to evaluate the additional privacy cost incurred by non-private data-dependent pre-processing algorithms. Our framework establishes upper bounds on the overall privacy guarantees by utilising two new technical notions: a variant of DP termed Smooth DP and the bounded sensitivity of the pre-processing algorithms. In addition to the generic framework, we provide explicit overall privacy guarantees for multiple data-dependent pre-processing algorithms, such as data imputation, quantization, deduplication and PCA, when used in combination with several DP algorithms. Notably, this framework is also simple to implement, allowing direct integration into existing DP pipelines.
翻译:在分析差分隐私机器学习流水线时,数据依赖的预处理在隐私核算中常被忽略其潜在的隐私代价。本文提出一个通用框架来评估非私有数据依赖预处理算法带来的额外隐私成本。该框架通过利用两个新的技术概念——称为平滑差分隐私的变体及预处理算法的有界灵敏度——建立整体隐私保证的上界。除通用框架外,我们为多种数据依赖预处理算法(如数据插补、量化、去重及主成分分析)与多种差分隐私算法结合使用时提供了明确的整体隐私保证。值得注意的是,该框架易于实现,可直接集成至现有差分隐私流水线。