Beyond Fixed False Discovery Rates: Post-Hoc Conformal Selection with E-Variables

Conformal selection (CS) uses calibration data to identify test inputs whose unobserved outcomes are likely to satisfy a pre-specified minimal quality requirement, while controlling the false discovery rate (FDR). Existing methods fix the target FDR level before observing data, which prevents the user from adapting the balance between number of selected test inputs and FDR to downstream needs and constraints based on the available data. For example, in genomics or neuroimaging, researchers often inspect the distribution of test statistics, and decide how aggressively to pursue candidates based on observed evidence strength and available follow-up resources. To address this limitation, we introduce {post-hoc CS} (PH-CS), which generates a path of candidate selection sets, each paired with a data-driven false discovery proportion (FDP) estimate. PH-CS lets the user select any operating point on this path by maximizing a user-specified utility, arbitrarily balancing selection size and FDR. Building on conformal e-variables and the e-Benjamini-Hochberg (e-BH) procedure, PH-CS is proved to provide a finite-sample post-hoc reliability guarantee whereby the ratio between estimated FDP level and true FDP is, on average, upper bounded by $1$, so that the average estimated FDP is, to first order, a valid upper bound on the true FDR. PH-CS is extended to control quality defined in terms of a general risk. Experiments on synthetic and real-world datasets demonstrate that, unlike CS, PH-CS can consistently satisfy user-imposed utility constraints while producing reliable FDP estimates and maintaining competitive FDR control.

翻译：共形选择（CS）利用校准数据识别未观测结果可能满足预设最低质量要求的测试输入，同时控制错误发现率（FDR）。现有方法在观察数据前固定目标FDR水平，这阻碍了用户根据下游需求和可用数据约束来调整所选测试输入数量与FDR之间的平衡。例如，在基因组学或神经影像学中，研究者通常检查检验统计量的分布，并根据观察到的证据强度和可用的后续资源决定是否进行激进地筛选候选者。为解决此局限，我们提出后验共形选择（PH-CS），该方法生成一条候选选择集路径，每个集合均与数据驱动的错误发现比例（FDP）估计值配对。PH-CS允许用户通过最大化自定义效用函数，任意平衡选择规模与FDR，从而选取该路径上的任意工作点。基于共形e变量和e-Benjamini-Hochberg（e-BH）程序，PH-CS被证明能提供有限样本下的后验可靠性保证：估计FDP水平与真实FDP之比平均上界为$1$，使得估计FDP均值在一阶近似下是真实FDR的有效上界。PH-CS还可扩展为控制基于一般风险定义的质量。在合成数据集和真实数据集上的实验表明，与CS不同，PH-CS能持续满足用户施加的效用约束，同时产生可靠的FDP估计并保持有竞争力的FDR控制。

相关内容

计算机科学

关注 56

计算机科学（Computer Science, CS）是系统性研究信息与计算的理论基础以及它们在计算机系统中如何实现与应用的实用技术的学科。它通常被形容为对那些创造、描述以及转换信息的算法处理的系统研究。计算机科学包含很多分支领域；其中一些，比如计算机图形学强调特定结果的计算，而另外一些，比如计算复杂性理论是学习计算问题的性质。还有一些领域专注于挑战怎样实现计算。比如程序设计语言理论学习描述计算的方法，而程序设计是应用特定的程序设计语言解决特定的计算问题，人机交互则是专注于挑战怎样使计算机和计算变得有用、可用，以及随时随地为人所用。 现代计算机科学( Computer Science)包含理论计算机科学和应用计算机科学两大分支。

【博士论文】《计算机视觉中潜在表示的不确定性》，66页pdf

专知会员服务

22+阅读 · 2024年8月28日

什么是共形预测？伯克利《共形预测》新书，David Stutz讲稿45页ppt

专知会员服务

61+阅读 · 2023年11月16日

【CMU博士论文】分布偏移下的不确定性量化，226页pdf

专知会员服务

31+阅读 · 2023年9月30日

什么是共形预测(conformal prediction)？LPSM最新《共形预测》教程，71页ppt

专知会员服务

44+阅读 · 2023年9月3日