As data-privacy requirements are becoming increasingly stringent and statistical models based on sensitive data are being deployed and used more routinely, protecting data-privacy becomes pivotal. Partial Least Squares (PLS) regression is the premier tool for building such models in analytical chemistry, yet it does not inherently provide privacy guarantees, leaving sensitive (training) data vulnerable to privacy attacks. To address this gap, we propose an $(\epsilon, \delta)$-differentially private PLS (edPLS) algorithm, which integrates well-studied and theoretically motivated Gaussian noise-adding mechanisms into the PLS algorithm to ensure the privacy of the data underlying the model. Our approach involves adding carefully calibrated Gaussian noise to the outputs of four key functions in the PLS algorithm: the weights, scores, $X$-loadings, and $Y$-loadings. The noise variance is determined based on the global sensitivity of each function, ensuring that the privacy loss is controlled according to the $(\epsilon, \delta)$-differential privacy framework. Specifically, we derive the sensitivity bounds for each function and use these bounds to calibrate the noise added to the model components. Experimental results demonstrate that edPLS effectively renders privacy attacks, aimed at recovering unique sources of variability in the training data, ineffective. Application of edPLS to the NIR corn benchmark dataset shows that the root mean squared error of prediction (RMSEP) remains competitive even at strong privacy levels (i.e., $\epsilon=1$), given proper pre-processing of the corresponding spectra. These findings highlight the practical utility of edPLS in creating privacy-preserving multivariate calibrations and for the analysis of their privacy-utility trade-offs.
翻译:随着数据隐私要求日益严格,以及基于敏感数据的统计模型被更常规地部署和使用,保护数据隐私变得至关重要。偏最小二乘(PLS)回归是分析化学中构建此类模型的主要工具,但其本身并不提供隐私保证,使得敏感(训练)数据容易受到隐私攻击。为弥补这一不足,我们提出了一种$(\epsilon, \delta)$-差分隐私PLS(edPLS)算法,该算法将经过充分研究且具有理论依据的高斯噪声添加机制集成到PLS算法中,以确保模型底层数据的隐私性。我们的方法涉及向PLS算法中四个关键函数的输出添加经过精心校准的高斯噪声:权重、得分、$X$载荷和$Y$载荷。噪声方差根据每个函数的全局敏感性确定,从而确保隐私损失按照$(\epsilon, \delta)$-差分隐私框架得到控制。具体而言,我们推导了每个函数的敏感性边界,并利用这些边界来校准添加到模型组件中的噪声。实验结果表明,edPLS能有效抵御旨在恢复训练数据中独特变异来源的隐私攻击。将edPLS应用于NIR玉米基准数据集表明,在对相应光谱进行适当预处理的情况下,即使在强隐私级别(即$\epsilon=1$)下,预测均方根误差(RMSEP)仍具有竞争力。这些发现凸显了edPLS在创建隐私保护多元校准模型以及分析其隐私-效用权衡方面的实际应用价值。