We introduce a new differentially private regression setting we call Private Regression in Multiple Outcomes (PRIMO), inspired the common situation where a data analyst wants to perform a set of $l$ regressions while preserving privacy, where the covariates $X$ are shared across all $l$ regressions, and each regression $i \in [l]$ has a different vector of outcomes $y_i$. While naively applying private linear regression techniques $l$ times leads to a $\sqrt{l}$ multiplicative increase in error over the standard linear regression setting, in Subsection $4.1$ we modify techniques based on sufficient statistics perturbation (SSP) to yield greatly improved dependence on $l$. In Subsection $4.2$ we prove an equivalence to the problem of privately releasing the answers to a special class of low-sensitivity queries we call inner product queries. Via this equivalence, we adapt the geometric projection-based methods from prior work on private query release to the PRIMO setting. Under the assumption the labels $Y$ are public, the projection gives improved results over the Gaussian mechanism when $n < l\sqrt{d}$, with no asymptotic dependence on $l$ in the error. In Subsection $4.3$ we study the complexity of our projection algorithm, and analyze a faster sub-sampling based variant in Subsection $4.4$. Finally in Section $5$ we apply our algorithms to the task of private genomic risk prediction for multiple phenotypes using data from the 1000 Genomes project. We find that for moderately large values of $l$ our techniques drastically improve the accuracy relative to both the naive baseline that uses existing private regression methods and our modified SSP algorithm that doesn't use the projection.
翻译:我们提出了一种新的差分隐私回归设置,称为"面向多输出的私有回归"(PRIMO),其灵感来源于数据分析师常见的场景:在保护隐私的前提下执行一组$l$个回归任务,其中协变量$X$在所有$l$个回归中共享,而每个回归$i \in [l]$具有不同的结果向量$y_i$。虽然直接对标准线性回归设置应用$l$次私有线性回归技术会导致误差呈$\sqrt{l}$倍的乘性增长,但在第$4.1$小节中,我们基于充分统计量扰动(SSP)方法进行了改进,从而显著优化了对$l$的依赖关系。在第$4.2$小节中,我们证明该问题等价于私有化发布一类特殊低灵敏度查询(称为内积查询)的答案。基于这一等价关系,我们将先前关于私有查询发布的几何投影方法适配到PRIMO场景。在假设标签$Y$公开的条件下,当$n < l\sqrt{d}$时,该方法相比高斯机制取得更优结果,且误差对$l$无渐近依赖。第$4.3$小节研究投影算法的复杂度,并在第$4.4$小节分析一种基于子采样的快速变体。最后,在第$5$节中,我们利用1000基因组计划的数据,将算法应用于多个表型的私有基因组风险预测任务。实验表明,对于中等规模的$l$值,我们的方法相比直接使用现有私有回归方法的朴素基线以及未采用投影的改进SSP算法,均能显著提升精度。