Ensuring that analyses performed on a dataset are representative of the entire population is one of the central problems in statistics. Most classical techniques assume that the dataset is independent of the analyst's query and break down in the common setting where a dataset is reused for multiple, adaptively chosen, queries. This problem of \emph{adaptive data analysis} was formalized in the seminal works of Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014). We identify a remarkably simple set of assumptions under which the queries will continue to be representative even when chosen adaptively: The only requirements are that each query takes as input a random subsample and outputs few bits. This result shows that the noise inherent in subsampling is sufficient to guarantee that query responses generalize. The simplicity of this subsampling-based framework allows it to model a variety of real-world scenarios not covered by prior work. In addition to its simplicity, we demonstrate the utility of this framework by designing mechanisms for two foundational tasks, statistical queries and median finding. In particular, our mechanism for answering the broadly applicable class of statistical queries is both extremely simple and state of the art in many parameter regimes.
翻译:确保在数据集上执行的分析能够代表整个总体是统计学中的核心问题之一。大多数经典方法假设数据集与分析者的查询相互独立,但在常见场景——即数据集被多次重复用于自适应选择的查询时——这些方法会失效。这一“自适应数据分析”问题由Dwork等人(STOC, 2015)以及Hardt和Ullman(FOCS, 2014)的开创性工作正式提出。我们确定了一组极为简单的假设,在该假设下,即使查询是自适应选择的,其结果仍能保持代表性:唯一的要求是每个查询以随机子样本为输入并输出少量比特。这一结果表明,子采样中固有的噪声足以保证查询响应的泛化性。由于该基于子采样框架的简单性,它能够建模先前工作未覆盖的多种现实场景。除了简单性之外,我们通过为两项基础任务——统计查询和中位数查找——设计机制,展示了该框架的实用性。特别是,我们用于回答广泛适用的统计查询类问题的机制既极其简单,又在许多参数设置下达到了当前最优水平。