Private data analysis faces a significant challenge known as the curse of dimensionality, leading to increased costs. However, many datasets possess an inherent low-dimensional structure. For instance, during optimization via gradient descent, the gradients frequently reside near a low-dimensional subspace. If the low-dimensional structure could be privately identified using a small amount of points, we could avoid paying (in terms of privacy and accuracy) for the high ambient dimension. On the negative side, Dwork, Talwar, Thakurta, and Zhang (STOC 2014) proved that privately estimating subspaces, in general, requires an amount of points that depends on the dimension. But Singhal and Steinke (NeurIPS 2021) bypassed this limitation by considering points that are i.i.d. samples from a Gaussian distribution whose covariance matrix has a certain eigenvalue gap. Yet, it was still left unclear whether we could provide similar upper bounds without distributional assumptions and whether we could prove lower bounds that depend on similar eigenvalue gaps. In this work, we make progress in both directions. We formulate the problem of private subspace estimation under two different types of singular value gaps of the input data and prove new upper and lower bounds for both types. In particular, our results determine what type of gap is sufficient and necessary for estimating a subspace with an amount of points that is independent of the dimension.
翻译:私有数据分析面临一个重大挑战,即维度灾难,这会导致成本增加。然而,许多数据集具有固有的低维结构。例如,在通过梯度下降进行优化时,梯度经常位于低维子空间附近。如果能够利用少量数据点私下识别出低维结构,我们就可以避免为高环境维度付出(隐私和准确性方面的)代价。负面来看,Dwork、Talwar、Thakurta 和 Zhang(STOC 2014)证明,一般而言,私下估计子空间所需的数据点数量取决于维度。但 Singhal 和 Steinke(NeurIPS 2021)通过考虑来自协方差矩阵具有特定特征值间隙的高斯分布的独立同分布样本,绕过了这一限制。然而,是否能在没有分布假设的情况下提供类似的上界,以及是否能证明依赖于类似特征值间隙的下界,这些问题仍不清楚。在这项工作中,我们在两个方向上取得了进展。我们根据输入数据的两类不同奇异值间隙,形式化了私有子空间估计问题,并为这两类间隙证明了新的上界和下界。特别地,我们的结果确定了哪种间隙类型对于使用与维度无关的数据点数量估计子空间是充分且必要的。