When analysing Differentially Private (DP) machine learning pipelines, the potential privacy cost of data-dependent pre-processing is frequently overlooked in privacy accounting. In this work, we propose a general framework to evaluate the additional privacy cost incurred by non-private data-dependent pre-processing algorithms. Our framework establishes upper bounds on the overall privacy guarantees by utilising two new technical notions: a variant of DP termed Smooth DP and the bounded sensitivity of the pre-processing algorithms. In addition to the generic framework, we provide explicit overall privacy guarantees for multiple data-dependent pre-processing algorithms, such as data imputation, quantization, deduplication and PCA, when used in combination with several DP algorithms. Notably, this framework is also simple to implement, allowing direct integration into existing DP pipelines.
翻译:在分析差分隐私机器学习流水线时,数据依赖预处理的潜在隐私成本经常在隐私核算中被忽视。本文提出一个通用框架来评估非私有数据依赖预处理算法所产生的额外隐私成本。该框架通过利用两种新技术概念——称为平滑差分隐私的变体以及预处理算法的有界敏感性——建立了整体隐私保证的上界。除通用框架外,本文还为多种数据依赖预处理算法(如数据插补、量化、去重和主成分分析)在与多种差分隐私算法组合使用时提供了明确的整体隐私保证。值得注意的是,该框架易于实现,可直接集成到现有的差分隐私流水线中。