We give a new proof of the "transfer theorem" underlying adaptive data analysis: that any mechanism for answering adaptively chosen statistical queries that is differentially private and sample-accurate is also accurate out-of-sample. Our new proof is elementary and gives structural insights that we expect will be useful elsewhere. We show: 1) that differential privacy ensures that the expectation of any query on the posterior distribution on datasets induced by the transcript of the interaction is close to its true value on the data distribution, and 2) sample accuracy on its own ensures that any query answer produced by the mechanism is close to its posterior expectation with high probability. This second claim follows from a thought experiment in which we imagine that the dataset is resampled from the posterior distribution after the mechanism has committed to its answers. The transfer theorem then follows by summing these two bounds, and in particular, avoids the "monitor argument" used to derive high probability bounds in prior work. An upshot of our new proof technique is that the concrete bounds we obtain are substantially better than the best previously known bounds, even though the improvements are in the constants, rather than the asymptotics (which are known to be tight). As we show, our new bounds outperform the naive "sample-splitting" baseline at dramatically smaller dataset sizes compared to the previous state of the art, bringing techniques from this literature closer to practicality.
翻译:我们为自适应数据分析的"传递定理"提供了一个新的证明:任何用于回答自适应选择的统计查询的机制,只要满足差分隐私和样本准确性,则同样具有样本外准确性。我们的新证明是初等的,并提供了我们预期在其他地方也有用的结构洞见。我们证明:1)差分隐私确保了由交互记录诱导的数据集后验分布上任何查询的期望值接近其在数据分布上的真实值;2)样本准确性本身确保了机制产生的任何查询答案以高概率接近其后验期望。第二个结论源于一个思想实验:我们设想在机制确定其答案后,数据集从后验分布中重新采样。传递定理随后通过将这两个界相加而得,特别地避免了先前工作中用于推导高概率界的"监控论证"。我们新证明技术的一个直接结果是,所获得的具体界显著优于先前已知的最佳界,尽管改进体现在常数而非渐近性上(已知渐近性是紧的)。如我们所示,与先前技术水平相比,我们的新界在更小的数据集规模上就显著优于朴素的"样本分割"基线,使该文献中的技术更接近实用化。