Linear $L_1$-regularized models have remained one of the simplest and most effective tools in data analysis, especially in information retrieval problems where n-grams over text with TF-IDF or Okapi feature values are a strong and easy baseline. Over the past decade, screening rules have risen in popularity as a way to reduce the runtime for producing the sparse regression weights of $L_1$ models. However, despite the increasing need of privacy-preserving models in information retrieval, to the best of our knoweledge, no differentially private screening rule exists. In this paper, we develop the first differentially private screening rule for linear and logistic regression. In doing so, we discover difficulties in the task of making a useful private screening rule due to the amount of noise added to ensure privacy. We provide theoretical arguments and experimental evidence that this difficulty arises from the screening step itself and not the private optimizer. Based on our results, we highlight that developing an effective private $L_1$ screening method is an open problem in the differential privacy literature.
翻译:线性 $L_1$ 正则化模型一直是数据分析中最简单且最有效的工具之一,尤其在信息检索问题中,使用基于 TF-IDF 或 Okapi 特征值的文本 n-gram 作为强而简单的基线。过去十年间,筛选规则作为降低 $L_1$ 模型稀疏回归权重计算时间的方法日益流行。然而,尽管信息检索领域对隐私保护模型的需求日益增长,据我们所知,目前尚不存在差分隐私的筛选规则。本文首次为线性和逻辑回归开发了差分隐私筛选规则。在此过程中,我们发现由于为确保隐私而添加的噪声量,构建实用的隐私筛选规则面临困难。我们通过理论论证和实验证据表明,这一困难源于筛选步骤本身,而非隐私优化器。基于我们的结果,我们强调开发有效的隐私 $L_1$ 筛选方法是差分隐私文献中的一个开放问题。