Simultaneous Conformal Prediction of Missing Outcomes with Propensity Score $ε$-Discretization

We study the problem of simultaneous predictive inference on multiple outcomes missing at random. We consider a suite of possible simultaneous coverage properties, conditionally on the missingness pattern and on the -- possibly discretized/binned -- feature values. For data with discrete feature distributions, we develop a procedure which attains feature- and missingness-conditional coverage; and further improve it via pooling its results after partitioning the unobserved outcomes. To handle general continuous feature distributions, we introduce methods based on discretized feature values. To mitigate the issue that feature-discretized data may fail to remain missing at random, we propose propensity score $\epsilon$-discretization. This approach is inspired by the balancing property of the propensity score, namely that the missing data mechanism is independent of the outcome conditional on the propensity [Rosenbaum and Rubin (1983)]. We show that the resulting pro-CP method achieves propensity score discretized feature- and missingness-conditional coverage, when the propensity score is known exactly or is estimated sufficiently accurately. Furthermore, we consider a stronger inferential target, the squared-coverage guarantee, which penalizes the spread of the coverage proportion. We propose methods -- termed pro-CP2 -- to achieve it with similar conditional properties as we have shown for usual coverage. A key novel technical contribution in our results is that propensity score discretization leads to a notion of approximate balancing, which we formalize and characterize precisely. In extensive empirical experiments on simulated data and on a job search intervention dataset, we illustrate that our procedures provide informative prediction sets with valid conditional coverage.

翻译：我们研究在随机缺失机制下对多个结局变量进行同时预测推断的问题。我们考虑一系列可能的同时覆盖性质，这些性质条件依赖于缺失模式以及（可能经过离散化/分箱处理的）特征值。针对具有离散特征分布的数据，我们开发了一种能够实现特征条件与缺失条件覆盖的方法；并通过在划分未观测结局后对其结果进行池化来进一步改进该方法。为处理一般的连续特征分布，我们引入了基于离散化特征值的方法。针对特征离散化数据可能不再保持随机缺失特性的问题，我们提出倾向得分ε-离散化方法。该方法的灵感来源于倾向得分的平衡性质，即缺失数据机制在给定倾向得分的条件下独立于结局变量[Rosenbaum and Rubin (1983)]。我们证明，当倾向得分精确已知或充分精确估计时，所提出的pro-CP方法能够实现倾向得分离散化的特征条件与缺失条件覆盖。此外，我们考虑更强的推断目标——平方覆盖保证，该目标通过惩罚覆盖比例的离散程度来实现。我们提出称为pro-CP2的方法，该方法具有与常规覆盖方法类似的条件性质。我们结果中一项关键的新技术贡献在于，倾向得分离散化催生了一种近似平衡的概念，我们对此进行了形式化定义与精确刻画。通过在模拟数据和求职干预数据集上的大量实证实验，我们证明所提出的方法能够提供具有有效条件覆盖的信息性预测集。