Improved Analysis of Sparse Linear Regression in Local Differential Privacy Model

In this paper, we revisit the problem of sparse linear regression in the local differential privacy (LDP) model. Existing research in the non-interactive and sequentially local models has focused on obtaining the lower bounds for the case where the underlying parameter is $1$-sparse, and extending such bounds to the more general $k$-sparse case has proven to be challenging. Moreover, it is unclear whether efficient non-interactive LDP (NLDP) algorithms exist. To address these issues, we first consider the problem in the $\epsilon$ non-interactive LDP model and provide a lower bound of $\Omega(\frac{\sqrt{dk\log d}}{\sqrt{n}\epsilon})$ on the $\ell_2$-norm estimation error for sub-Gaussian data, where $n$ is the sample size and $d$ is the dimension of the space. We propose an innovative NLDP algorithm, the very first of its kind for the problem. As a remarkable outcome, this algorithm also yields a novel and highly efficient estimator as a valuable by-product. Our algorithm achieves an upper bound of $\tilde{O}({\frac{d\sqrt{k}}{\sqrt{n}\epsilon}})$ for the estimation error when the data is sub-Gaussian, which can be further improved by a factor of $O(\sqrt{d})$ if the server has additional public but unlabeled data. For the sequentially interactive LDP model, we show a similar lower bound of $\Omega({\frac{\sqrt{dk}}{\sqrt{n}\epsilon}})$. As for the upper bound, we rectify a previous method and show that it is possible to achieve a bound of $\tilde{O}(\frac{k\sqrt{d}}{\sqrt{n}\epsilon})$. Our findings reveal fundamental differences between the non-private case, central DP model, and local DP model in the sparse linear regression problem.

翻译：本文重新审视了本地差分隐私模型中的稀疏线性回归问题。现有针对非交互式与顺序局部模型的研究主要关注于底层参数为 $1$-稀疏情形的下界推导，而要将这类下界扩展到更一般的 $k$-稀疏情形被证明极具挑战性。此外，高效的非交互式本地差分隐私算法是否存在仍不明确。为解决上述问题，我们首先在 $\epsilon$ 非交互式本地差分隐私模型下展开研究，针对次高斯数据给出了 $\ell_2$ 范数估计误差的下界 $\Omega(\frac{\sqrt{dk\log d}}{\sqrt{n}\epsilon})$，其中 $n$ 为样本量，$d$ 为空间维度。我们提出了一种创新的非交互式本地差分隐私算法，这是该问题领域的首个此类算法。作为显著成果，该算法还衍生出了一种高效的新型估计器作为重要副产品。当数据服从次高斯分布时，我们的算法实现了 $\tilde{O}({\frac{d\sqrt{k}}{\sqrt{n}\epsilon}})$ 的估计误差上界，若服务器拥有额外公开但未标记的数据，该上界可进一步改善 $O(\sqrt{d})$ 因子。对于顺序交互式本地差分隐私模型，我们证明了相似的下界 $\Omega({\frac{\sqrt{dk}}{\sqrt{n}\epsilon}})$。在上界方面，我们修正了先前的分析方法，证明可以达到 $\tilde{O}(\frac{k\sqrt{d}}{\sqrt{n}\epsilon})$ 的界。本研究的发现揭示了稀疏线性回归问题在非私有情形、中心化差分隐私模型与本地差分隐私模型之间的本质差异。