Repeated use of a data sample via adaptively chosen queries can rapidly lead to overfitting, wherein the empirical evaluation of queries on the sample significantly deviates from their mean with respect to the underlying data distribution. It turns out that simple noise addition algorithms suffice to prevent this issue, and differential privacy-based analysis of these algorithms shows that they can handle an asymptotically optimal number of queries. However, differential privacy's worst-case nature entails scaling such noise to the range of the queries even for highly-concentrated queries, or introducing more complex algorithms. In this paper, we prove that straightforward noise-addition algorithms already provide variance-dependent guarantees that also extend to unbounded queries. This improvement stems from a novel characterization that illuminates the core problem of adaptive data analysis. We show that the harm of adaptivity results from the covariance between the new query and a Bayes factor-based measure of how much information about the data sample was encoded in the responses given to past queries. We then leverage this characterization to introduce a new data-dependent stability notion that can bound this covariance.
翻译:通过自适应选择的查询重复使用数据样本会迅速导致过拟合,此时查询在样本上的经验评估与其在基础数据分布下的均值出现显著偏差。事实证明,简单的噪声添加算法足以防止此问题,基于差分隐私的算法分析表明,这类算法能够处理渐近最优数量的查询。然而,差分隐私的极差本性(worst-case nature)要求即使对高度集中(highly-concentrated)的查询,也要根据查询范围缩放噪声,或引入更复杂的算法。本文证明,简单的噪声添加算法已能提供方差相关的保证,且这种保证还可扩展到无界查询。这一改进源于一种新颖的特征刻画,揭示了自适应数据分析的核心问题。我们证明,自适应性的危害源于新查询与基于贝叶斯因子(Bayes factor)的度量之间的协方差,该度量用于衡量过去查询的响应中编码的数据样本信息量。随后,我们利用这一特征刻画,引入了一种新的数据依赖性稳定性概念,能够约束此协方差。