Given data on a scalar random variable $Y$, a prediction set for $Y$ with miscoverage level $\alpha$ is a set of values for $Y$ that contains a randomly drawn $Y$ with probability $1 - \alpha$, where $\alpha \in (0,1)$. Among all prediction sets that satisfy this coverage property, the oracle prediction set is the one with the smallest volume. This paper provides estimation methods of such prediction sets given observed conditioning covariates when $Y$ is censored or measured in intervals. We first characterise the oracle prediction set under interval censoring and develop a consistent estimator for the shortest prediction interval that satisfies this coverage property. We then extend these consistency results to accommodate cases where the prediction set consists of multiple disjoint intervals. Second, we use conformal inference to construct a prediction set that achieves a particular notion of finite-sample validity under censoring and maintains consistency as sample size increases. This notion exploits exchangeability to obtain finite sample guarantees on coverage using a specially constructed conformity score function. The procedure accomodates the prediction uncertainty that is irreducible (due to the stochastic nature of outcomes), the modelling uncertainty due to partial identification and also sampling uncertainty that gets reduced as samples get larger. We conduct a set of Monte Carlo simulations and an application to data from the Current Population Survey. The results highlight the robustness and efficiency of the proposed methods.
翻译:给定标量随机变量$Y$的数据,具有误覆盖水平$\alpha$的$Y$预测集是一个$Y$值的集合,它以$1 - \alpha$的概率包含随机抽取的$Y$,其中$\alpha \in (0,1)$。在所有满足此覆盖性质的预测集中,最优预测集是体积最小的那个。本文针对$Y$被截断或以区间形式测量时的情况,提出了给定观测条件协变量下此类预测集的估计方法。我们首先刻画了区间截断下的最优预测集,并开发了满足此覆盖性质的最短预测区间的一致估计量。随后将这些一致性结果推广到预测集由多个不相交区间构成的情形。其次,我们运用保形推断构建了一个预测集,该集在截断条件下实现了特定形式的有限样本有效性,并随着样本量增加保持一致性。该方法利用可交换性,通过特别构建的适配度评分函数获得覆盖率的有限样本保证。该程序同时容纳了不可约的预测不确定性(源于结果的随机性)、由部分识别引起的建模不确定性以及随样本增大而减小的抽样不确定性。我们进行了一系列蒙特卡洛模拟,并将方法应用于当前人口调查数据。结果突显了所提方法的鲁棒性与效率。