Beyond Fixed False Discovery Rates: Post-Hoc Conformal Selection with E-Variables

Conformal selection (CS) uses calibration data to identify test inputs whose unobserved outcomes are likely to satisfy a pre-specified minimal quality requirement, while controlling the false discovery rate (FDR). Existing methods fix the target FDR level before observing data, which prevents the user from adapting the balance between number of selected test inputs and FDR to downstream needs and constraints based on the available data. For example, in genomics or neuroimaging, researchers often inspect the distribution of test statistics, and decide how aggressively to pursue candidates based on observed evidence strength and available follow-up resources. To address this limitation, we introduce {post-hoc CS} (PH-CS), which generates a path of candidate selection sets, each paired with a data-driven false discovery proportion (FDP) estimate. PH-CS lets the user select any operating point on this path by maximizing a user-specified utility, arbitrarily balancing selection size and FDR. Building on conformal e-variables and the e-Benjamini-Hochberg (e-BH) procedure, PH-CS is proved to provide a finite-sample post-hoc reliability guarantee whereby the ratio between estimated FDP level and true FDP is, on average, upper bounded by $1$, so that the average estimated FDP is, to first order, a valid upper bound on the true FDR. PH-CS is extended to control quality defined in terms of a general risk. Experiments on synthetic and real-world datasets demonstrate that, unlike CS, PH-CS can consistently satisfy user-imposed utility constraints while producing reliable FDP estimates and maintaining competitive FDR control.

翻译：共形选择（CS）利用校准数据识别未观测结果可能满足预设最低质量要求的测试输入，同时控制错误发现率（FDR）。现有方法在观测数据前固定目标FDR水平，这限制了用户根据下游需求和可用数据约束动态调整所选测试输入数量与FDR之间的平衡。例如，在基因组学或神经影像学中，研究者常通过检验统计量分布，根据观测证据强度和可用后续资源决定候选对象的筛选激进程度。为解决此局限，我们提出后验共形选择（PH-CS），该方法生成一条候选选择集路径，每个集合均配有数据驱动的错误发现比例（FDP）估计。PH-CS允许用户通过最大化自定义效用函数，任意平衡选择规模与FDR，从而在该路径上选择任意操作点。基于共形e-变量和e-本杰明-霍赫贝格（e-BH）过程，PH-CS被证明可提供有限样本下的后验可靠性保证：估计FDP水平与真实FDP的比率平均上界不超过1，因此一阶意义上平均估计FDP是真实FDR的有效上界。PH-CS进一步扩展至以一般风险定义的质量控制。在合成与真实数据集上的实验表明，与CS不同，PH-CS在产生可靠FDP估计并维持竞争性FDR控制的同时，能够持续满足用户施加的效用约束。

相关内容

计算机科学

关注 56

计算机科学（Computer Science, CS）是系统性研究信息与计算的理论基础以及它们在计算机系统中如何实现与应用的实用技术的学科。它通常被形容为对那些创造、描述以及转换信息的算法处理的系统研究。计算机科学包含很多分支领域；其中一些，比如计算机图形学强调特定结果的计算，而另外一些，比如计算复杂性理论是学习计算问题的性质。还有一些领域专注于挑战怎样实现计算。比如程序设计语言理论学习描述计算的方法，而程序设计是应用特定的程序设计语言解决特定的计算问题，人机交互则是专注于挑战怎样使计算机和计算变得有用、可用，以及随时随地为人所用。 现代计算机科学( Computer Science)包含理论计算机科学和应用计算机科学两大分支。

【博士论文】《计算机视觉中潜在表示的不确定性》，66页pdf

专知会员服务

22+阅读 · 2024年8月28日

什么是共形预测？伯克利《共形预测》新书，David Stutz讲稿45页ppt

专知会员服务

61+阅读 · 2023年11月16日

【CMU博士论文】分布偏移下的不确定性量化，226页pdf

专知会员服务

31+阅读 · 2023年9月30日

什么是共形预测(conformal prediction)？LPSM最新《共形预测》教程，71页ppt

专知会员服务

44+阅读 · 2023年9月3日