With the wide application of machine learning techniques in practice, privacy preservation has gained increasing attention. Protecting user privacy with minimal accuracy loss is a fundamental task in the data analysis and mining community. In this paper, we focus on regression tasks under $ε$-label differential privacy guarantees. Some existing methods for regression with $ε$-label differential privacy, such as the RR-On-Bins mechanism, discretized the output space into finite bins and then applied RR algorithm. To efficiently determine these finite bins, the authors rounded the original responses down to integer values. However, such operations does not align well with real-world scenarios. To overcome these limitations, we model both original and randomized responses as continuous random variables, avoiding discretization entirely. Our novel approach estimates an optimal interval for randomized responses and introduces new algorithms designed for scenarios where a prior is either known or unknown. Additionally, we prove that our algorithm, RPWithPrior, guarantees $ε$-label differential privacy. Numerical results demonstrate that our approach gets better performance compared with the Gaussian, Laplace, Staircase, and RRonBins, Unbiased mechanisms on the Communities and Crime, Criteo Sponsored Search Conversion Log, California Housing datasets.
翻译:随着机器学习技术在实际应用中的广泛普及,隐私保护日益受到重视。如何在最小化精度损失的前提下保护用户隐私,已成为数据分析与挖掘领域的一项基本任务。本文聚焦于在$ε$-标签差分隐私保证下的回归任务。现有的一些满足$ε$-标签差分隐私的回归方法,例如RR-On-Bins机制,将输出空间离散化为有限个区间后应用RR算法。为了高效确定这些有限区间,作者将原始响应向下取整为整数值。然而,此类操作与现实场景的契合度不足。为克服这些局限性,我们将原始响应与随机化响应均建模为连续随机变量,从而完全避免离散化。我们提出的新方法估计了随机化响应的最优区间,并针对先验已知或未知的场景设计了新算法。此外,我们证明了所提算法RPWithPrior能够保证$ε$-标签差分隐私。数值实验结果表明,在Communities and Crime、Criteo Sponsored Search Conversion Log和California Housing数据集上,我们的方法相较于Gaussian、Laplace、Staircase、RRonBins及Unbiased机制均取得了更优的性能。