Semi-supervised (SS) inference has received much attention in recent years. Apart from a moderate-sized labeled data, L, the SS setting is characterized by an additional, much larger sized, unlabeled data, U. The setting of |U| >> |L|, makes SS inference unique and different from the standard missing data problems, owing to natural violation of the so-called "positivity" or "overlap" assumption. However, most of the SS literature implicitly assumes L and U to be equally distributed, i.e., no selection bias in the labeling. Inferential challenges in missing at random (MAR) type labeling allowing for selection bias, are inevitably exacerbated by the decaying nature of the propensity score (PS). We address this gap for a prototype problem, the estimation of the response's mean. We propose a double robust SS (DRSS) mean estimator and give a complete characterization of its asymptotic properties. The proposed estimator is consistent as long as either the outcome or the PS model is correctly specified. When both models are correctly specified, we provide inference results with a non-standard consistency rate that depends on the smaller size |L|. The results are also extended to causal inference with imbalanced treatment groups. Further, we provide several novel choices of models and estimators of the decaying PS, including a novel offset logistic model and a stratified labeling model. We present their properties under both high and low dimensional settings. These may be of independent interest. Lastly, we present extensive simulations and also a real data application.
翻译:半监督推断近年来受到广泛关注。除中等规模的有标签数据L外,半监督设置还包含一个规模更大的无标签数据U。U的样本量远大于L这一特点,使得半监督推断因自然违背所谓的"正性"或"重叠"假设而不同于标准的缺失数据问题。然而,现有半监督文献大多隐含假设L与U具有相同分布,即标记过程不存在选择偏差。在允许选择偏差的随机缺失型标记中,倾向得分衰减特性进一步加剧了推断的挑战。针对响应变量均值估计这一原型问题,我们填补了相关研究空白。提出双重稳健半监督均值估计量并完整刻画其渐近性质。当结果模型或倾向得分模型任一正确设定时,该估计量具有相合性。当两个模型均正确设定时,我们提供了依赖于较小样本量L的非标准相合率下的推断结果,并将结论推广至处理组不均衡的因果推断场景。进一步,我们提出了衰减倾向得分的新型模型与估计量选择,包括新型偏移逻辑模型和分层标记模型,并展示了这些方法在高维与低维设置下的性质,这些结果具有独立研究价值。最后,我们进行了广泛模拟实验并应用于真实数据。