We study the differential privacy (DP) of a core ML problem, linear ordinary least squares (OLS), a.k.a. $\ell_2$-regression. Our key result is that the approximate LS algorithm (ALS) (Sarlos, 2006), a randomized solution to the OLS problem primarily used to improve performance on large datasets, also preserves privacy. ALS achieves a better privacy/utility tradeoff, without modifications or further noising, when compared to alternative private OLS algorithms which modify and/or noise OLS. We give the first {\em tight} DP-analysis for the ALS algorithm and the standard Gaussian mechanism (Dwork et al., 2014) applied to OLS. Our methodology directly improves the privacy analysis of (Blocki et al., 2012) and (Sheffet, 2019)) and introduces new tools which may be of independent interest: (1) the exact spectrum of $(\epsilon, \delta)$-DP parameters (``DP spectrum") for mechanisms whose output is a $d$-dimensional Gaussian, and (2) an improved DP spectrum for random projection (compared to (Blocki et al., 2012) and (Sheffet, 2019)). All methods for private OLS (including ours) assume, often implicitly, restrictions on the input database, such as bounds on leverage and residuals. We prove that such restrictions are necessary. Hence, computing the privacy of mechanisms such as ALS must estimate these database parameters, which can be infeasible in big datasets. For more complex ML models, DP bounds may not even be tractable. There is a need for blackbox DP-estimators (Lu et al., 2022) which empirically estimate a data-dependent privacy. We demonstrate the effectiveness of such a DP-estimator by empirically recovering a DP-spectrum that matches our theory for OLS. This validates the DP-estimator in a nontrivial ML application, opening the door to its use in more complex nonlinear ML settings where theory is unavailable.
翻译:我们研究核心机器学习问题——线性普通最小二乘法(OLS,即$\ell_2$回归)的差分隐私(DP)特性。关键结果表明,近似最小二乘算法(ALS)(Sarlos, 2006)——一种主要用于提升大型数据集性能的OLS随机求解方法——同样具备隐私保护能力。与通过修改或加噪方式实现隐私保护的替代OLS算法相比,ALS无需额外修改或加噪即可实现更优的隐私/效用权衡。我们首次给出了ALS算法以及应用于OLS的标准高斯机制(Dwork等人,2014年)的严格DP分析。该方法直接改进了(Blocki等人,2012年)和(Sheffet,2019年)的隐私分析,并引入了两项可能具有独立价值的新工具:(1)输出为$d$维高斯分布的机制其$(\epsilon, \delta)$-DP参数精确谱(“DP谱”),以及(2)相较于(Blocki等人,2012年)和(Sheffet,2019年)改进的随机投影DP谱。所有私有OLS方法(包括我们的)通常隐含地对输入数据库施加约束条件,如杠杆值和残差范围。我们证明此类约束具有必要性。因此,计算ALS等机制的隐私性必须估计这些数据库参数,这在大型数据集中可能难以实现。对于更复杂的机器学习模型,DP边界甚至可能无法推导。亟需能经验性估计数据依赖型隐私的黑盒DP估计器(Lu等人,2022年)。通过经验性恢复与OLS理论相匹配的DP谱,我们验证了该DP估计器的有效性,从而为非平凡机器学习应用中验证DP估计器(该估计器已开源)的有效性奠定了理论基础——这为将其应用于理论尚不可得的更复杂非线性机器学习场景打开了大门。