When analysing Differentially Private (DP) machine learning pipelines, the potential privacy cost of data-dependent pre-processing is frequently overlooked in privacy accounting. In this work, we propose a general framework to evaluate the additional privacy cost incurred by non-private data-dependent pre-processing algorithms. Our framework establishes upper bounds on the overall privacy guarantees by utilising two new technical notions: a variant of DP termed Smooth DP and the bounded sensitivity of the pre-processing algorithms. In addition to the generic framework, we provide explicit overall privacy guarantees for multiple data-dependent pre-processing algorithms, such as data imputation, quantization, deduplication and PCA, when used in combination with several DP algorithms. Notably, this framework is also simple to implement, allowing direct integration into existing DP pipelines.
翻译:在分析差分隐私(DP)机器学习流程时,数据依赖型预处理的潜在隐私成本经常在隐私核算中被忽视。本研究提出了一个通用框架,用于评估非隐私数据依赖型预处理算法所产生的额外隐私成本。该框架通过利用两个新的技术概念——差分隐私的变体“平滑差分隐私”以及预处理算法的有界敏感性——建立了整体隐私保障的上界。除通用框架外,我们还针对多种数据依赖型预处理算法(如数据插补、量化、去重和主成分分析)与若干DP算法结合使用时,给出了明确的整体隐私保障。值得注意的是,该框架实现简单,可直接集成到现有DP流程中。