Given a family of pretrained models and a hold-out set, how can we construct a valid conformal prediction set while selecting a model that minimizes the width of the set? If we use the same hold-out data set both to select a model (the model that yields the smallest conformal prediction sets) and then to construct a conformal prediction set based on that selected model, we suffer a loss of coverage due to selection bias. Alternatively, we could further split the data to perform selection and calibration separately, but this comes at a steep cost if the size of the dataset is limited. In this paper, we address the challenge of constructing a valid prediction set after data-dependent model selection -- commonly, selecting the model that minimizes the width of the resulting prediction sets. Our novel methods can be implemented efficiently and admit finite-sample validity guarantees without invoking additional sample-splitting. We show that our methods yield prediction sets with asymptotically optimal width under certain notion of regularity for the model class. The improvement in the width of the prediction sets constructed by our methods are further demonstrated through applications to synthetic datasets in various settings and a real data example.
翻译:给定一组预训练模型和一个保留数据集,如何在选择最小化集合宽度的模型的同时构造有效的共形预测集?若使用同一保留数据集既用于选择模型(即产生最小共形预测集的模型),又基于该选定模型构造共形预测集,则会因选择偏差导致覆盖率的下降。另一种方案是将数据进一步分割,分别执行模型选择与校准,但在数据集规模有限时,这种方法会付出高昂代价。本文针对数据依赖模型选择后(通常选择最小化预测集宽度的模型)构造有效预测集的挑战提出解决方案。我们的新方法可高效实现,且无需额外数据分割即可保证有限样本有效性。我们证明,在模型类的某种正则性条件下,本方法能生成渐近最优宽度的预测集。通过多场景合成数据集实验及真实数据案例,进一步验证了本方法在预测集宽度优化方面的优越性。