Ensuring that analyses performed on a dataset are representative of the entire population is one of the central problems in statistics. Most classical techniques assume that the dataset is independent of the analyst's query and break down in the common setting where a dataset is reused for multiple, adaptively chosen, queries. This problem of \emph{adaptive data analysis} was formalized in the seminal works of Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014). We identify a remarkably simple set of assumptions under which the queries will continue to be representative even when chosen adaptively: The only requirements are that each query takes as input a random subsample and outputs few bits. This result shows that the noise inherent in subsampling is sufficient to guarantee that query responses generalize. The simplicity of this subsampling-based framework allows it to model a variety of real-world scenarios not covered by prior work. In addition to its simplicity, we demonstrate the utility of this framework by designing mechanisms for two foundational tasks, statistical queries and median finding. In particular, our mechanism for answering the broadly applicable class of statistical queries is both extremely simple and state of the art in many parameter regimes.
翻译:确保在数据集上执行的分析能够代表整个总体是统计学的核心问题之一。大多数经典技术假设数据集与分析者的查询独立,但在数据集被重复用于多个自适应选择的查询这一常见场景中,这些技术会失效。这种**自适应数据分析**问题在Dwork等人(STOC, 2015)和Hardt与Ullman(FOCS, 2014)的开创性工作中被正式定义。我们识别出一组极为简单的假设,在此条件下,即使查询是自适应选择的,它们仍能保持代表性:唯一的要求是每个查询以随机子样本作为输入,并输出少量比特。这一结果表明,子采样固有的噪声足以保证查询响应的泛化性。这种基于子采样的框架因其简洁性,能够模拟先前工作未涵盖的多种现实场景。除简洁性外,我们通过为统计查询和中位数查找这两项基础任务设计机制,展示了该框架的实用性。特别是,我们用于回答广泛适用的统计查询类别的机制既极其简单,又在许多参数设置中达到了最优水平。