Scientific discovery relies on large-scale hypothesis testing. However, the capacity to identify true discoveries while controlling false discovery faces major challenges: obtaining relevant reference data (the null distribution) is resource-intensive, leaving finite-data uncertainty, and the procedure should account for the inherent structure in the hypothesis space, when such structure exists. Here, we present a framework for controlling the false discovery rate both when each hypothesis is evidenced only by a finite count of null draws, leaving its p-value uncertain, and when the hypothesis space carries arbitrary structure, requiring only that the structure be represented through a suitable reproducing kernel. We present two decision rules that are both robust to structural mis-specification, yet offer a distinct trade-off between exact FDR control and statistical power. The first rule guarantees exact FDR control; the second maximizes power by adapting mirror-statistic control into count space, utilizing an analytical framework to assess FDR control when exact mirror symmetry is relaxed. Furthermore, the tractability gained by the RKHS framework allows us to directly investigate finite-data uncertainties, which we leverage to suggest a policy for the efficient allocation of null distribution samples.
翻译:科学发现依赖于大规模假设检验。然而,在控制假发现的同时识别真实发现的能力面临重大挑战:获取相关参考数据(零分布)需要大量资源,导致有限数据的不确定性,且当假设空间存在固有结构时,检验程序应考虑到这种结构。本文提出一个框架,用于在以下两种情况下控制假发现率:一是每个假设仅通过有限数量的零样本获得证据,导致其p值存在不确定性;二是假设空间带有任意结构,只需通过合适的再生核表示该结构即可。我们提出两种决策规则,既对结构误设具有鲁棒性,又在精确FDR控制与统计功效之间提供不同的权衡。第一种规则保证精确的FDR控制;第二种规则通过将镜像统计量控制适应到计数空间来最大化功效,并利用分析框架在放松精确镜像对称性时评估FDR控制。此外,RKHS框架带来的可处理性使我们能够直接研究有限数据的不确定性,并据此提出有效分配零分布样本的策略。