Modern data workflows are inherently adaptive, repeatedly querying the same dataset to refine and validate sequential decisions, but such adaptivity can lead to overfitting and invalid statistical inference. Adaptive Data Analysis (ADA) mechanisms address this challenge; however, there is a fundamental tension between computational efficiency and sample complexity. For $T$ rounds of adaptive analysis, computationally efficient algorithms typically incur suboptimal $O(\sqrt{T})$ sample complexity, whereas statistically optimal $O(\log T)$ algorithms are computationally intractable under standard cryptographic assumptions. In this work, we shed light on this trade-off by identifying a natural class of data distributions under which both computational efficiency and optimal sample complexity are achievable. We propose a computationally efficient ADA mechanism that attains optimal $O(\log T)$ sample complexity when the data distribution is dense with respect to a known prior. This setting includes, in particular, feature--label data distributions arising in distribution-specific learning. As a consequence, our mechanism also yields a sample-efficient (i.e., $O(\log T)$ samples) statistical query oracle in the distribution-specific setting. Moreover, although our algorithm is not based on differential privacy, it satisfies a relaxed privacy notion known as Predicate Singling Out (PSO) security (Cohen and Nissim, 2020). Our results thus reveal an inherent connection between adaptive data analysis and privacy beyond differential privacy.
翻译:现代数据工作流本质上是自适应的,需要反复查询同一数据集以优化和验证序列决策,但这种自适应性可能导致过拟合和无效的统计推断。自适应数据分析(ADA)机制旨在应对这一挑战;然而,计算效率与样本复杂度之间存在根本性矛盾。对于$T$轮自适应分析,计算高效的算法通常会产生次优的$O(\sqrt{T})$样本复杂度,而统计最优的$O(\log T)$算法在标准密码学假设下是计算不可行的。本工作中,我们通过识别一类自然的数据分布来阐明这一权衡,在该类分布下计算效率和最优样本复杂度均可实现。我们提出一种计算高效的ADA机制,当数据分布相对于已知先验呈稠密性时,该机制可获得最优的$O(\log T)$样本复杂度。这一设定特别涵盖了分布特定学习中出现的特征-标签数据分布。因此,我们的机制还在分布特定设定下产生了一个样本高效(即$O(\log T)$样本)的统计查询预言机。此外,尽管我们的算法不基于差分隐私,但它满足一种称为谓词单点提取(PSO)安全性的松弛隐私概念(Cohen与Nissim,2020)。因此,我们的结果揭示了自适应数据分析与差分隐私之外的其他隐私形式之间的内在联系。