Conformal prediction builds marginally valid prediction intervals that cover the unknown outcome of a randomly drawn new test point with a prescribed probability. However, a common scenario in practice is that, after seeing the data, practitioners decide which test unit(s) to focus on in a data-driven manner and seek for uncertainty quantification of the focal unit(s). In such cases, marginally valid conformal prediction intervals may not provide valid coverage for the focal unit(s) due to selection bias. This paper presents a general framework for constructing a prediction set with finite-sample exact coverage conditional on the unit being selected by a given procedure. The general form of our method works for arbitrary selection rules that are invariant to the permutation of the calibration units, and generalizes Mondrian Conformal Prediction to multiple test units and non-equivariant classifiers. We then work out the computationally efficient implementation of our framework for a number of realistic selection rules, including top-K selection, optimization-based selection, selection based on conformal p-values, and selection based on properties of preliminary conformal prediction sets. The performance of our methods is demonstrated via applications in drug discovery and health risk prediction.
翻译:共形预测构建边际有效的预测区间,能够以规定概率覆盖随机抽取的新测试点的未知结果。然而,实践中常见的情况是,在观察数据后,研究者以数据驱动的方式决定聚焦于哪个(哪些)测试单元,并寻求对聚焦单元的不确定性量化。在此类情况下,因选择偏差的存在,边际有效的共形预测区间可能无法为聚焦单元提供有效的覆盖率。本文提出一个通用框架,用于构建在给定程序下被选中的单元上具有有限样本精确条件覆盖率的预测集。该方法的通用形式适用于任意对校准单元置换具有不变性的选择规则,并将蒙德里安共形预测推广至多个测试单元和非等变分类器。随后,针对一系列现实选择规则,我们推导出该框架的高效计算实现方案,包括前K项选择、基于优化的选择、基于共形p值的选择以及基于初步共形预测集属性的选择。通过药物发现和健康风险预测中的应用,我们证明了所提方法的性能。