Demystifying Prediction Powered Inference

Machine learning predictions are increasingly used to supplement incomplete or costly-to-measure outcomes in fields such as biomedical research, environmental science, and social science. However, treating predictions as ground truth introduces bias while ignoring them wastes valuable information. Prediction-Powered Inference (PPI) offers a principled framework that leverages predictions from large unlabeled datasets to improve statistical efficiency while maintaining valid inference through explicit bias correction using a smaller labeled subset. Despite its potential, the growing PPI variants and the subtle distinctions between them have made it challenging for practitioners to determine when and how to apply these methods responsibly. This paper demystifies PPI by synthesizing its theoretical foundations, methodological extensions, connections to existing statistics literature, and diagnostic tools into a unified practical workflow. Using the Mosaiks housing price data, we show that PPI variants produce tighter confidence intervals than complete-case analysis, but that double-dipping, i.e. reusing training data for inference, leads to anti-conservative confidence intervals and coverages. Under missing-not-at-random mechanisms, all methods, including classical inference using only labeled data, yield biased estimates. We provide a decision flowchart linking assumption violations to appropriate PPI variants, a summary table of selective methods, and practical diagnostic strategies for evaluating core assumptions. By framing PPI as a general recipe rather than a single estimator, this work bridges methodological innovation and applied practice, helping researchers responsibly integrate predictions into valid inference.

翻译：机器学习预测正日益被用于补充生物医学研究、环境科学和社会科学等领域中不完整或测量成本高昂的结果。然而，将预测视为真实值会引入偏差，而忽略它们则会浪费宝贵信息。预测增强推断（PPI）提供了一个原则性框架，它利用来自大型未标记数据集的预测来提高统计效率，同时通过使用较小标记子集进行显式偏差校正来保持有效的推断。尽管潜力巨大，但日益增多的PPI变体及其间的细微差异使得实践者难以确定何时以及如何负责任地应用这些方法。本文通过将PPI的理论基础、方法扩展、与现有统计学文献的联系以及诊断工具综合为一个统一的实践工作流程，从而揭开其神秘面纱。利用Mosaiks房价数据，我们展示了PPI变体能产生比完整案例分析更紧凑的置信区间，但双重利用（即重复使用训练数据进行推断）会导致非保守的置信区间和覆盖率。在非随机缺失机制下，所有方法（包括仅使用标记数据的经典推断）都会产生有偏估计。我们提供了一个将假设违反情况与适当PPI变体相联系的决策流程图、一份选择性方法汇总表，以及用于评估核心假设的实用诊断策略。通过将PPI构建为通用方案而非单一估计器，本研究架起了方法创新与应用实践之间的桥梁，帮助研究者负责任地将预测整合到有效推断中。