Prediction-powered inference (PPI) is a rapidly growing framework for combining machine learning predictions with a small set of gold-standard labels to conduct valid statistical inference. In this article, I argue that the core estimators underlying PPI are equivalent to well-established estimators from the survey sampling literature dating back to the 1970s. Specifically, the PPI estimator for a population mean is algebraically equivalent to the difference estimator of Cassel et al. (1976), and PPI plus corresponds to the generalized regression (GREG) estimator of Sarndal et al. (2003). Recognizing this equivalence, I consider what part of PPI is inherited from a long-standing literature in statistics, what part is genuinely new, and where inferential claims require care. After introducing the two frameworks and establishing their equivalence, I break down where PPI diverges from model-assisted estimation, including differences in the mode of inference, the role of the unlabeled data pool, and the consequences of differential prediction error for subgroup estimands such as the average treatment effect. I then identify what each framework offers the other: PPI researchers can draw on the survey sampling literature's well-developed theory of calibration, optimal allocation, and design-based diagnostics, while survey sampling researchers can benefit from PPI's extensions to non-standard estimands and its accessible software ecosystem. The article closes with a call for integration between these two communities, motivated by the growing use of large language models as measurement instruments in applied research.
翻译:预测驱动推断(Prediction-Powered Inference, PPI)是一种快速发展的框架,它将机器学习预测与少量黄金标准标签相结合,以进行有效的统计推断。本文论证了PPI的核心估计量等价于调查抽样文献中可追溯至20世纪70年代的成熟估计量。具体而言,总体均值的PPI估计量在代数上与Cassel等人(1976)的差分估计量等价,而PPI plus对应于Sarndal等人(2003)的广义回归(GREG)估计量。认识到这一等价性后,本文探讨了PPI中哪些部分继承自统计学中的长期文献,哪些部分是真正新颖的,以及哪些推断主张需要谨慎对待。在介绍了这两种框架并确立其等价性后,本文剖析了PPI与模型辅助估计的差异,包括推断模式、未标记数据池的作用,以及预测误差差异对子组估计量(如平均处理效应)的影响。随后,本文指出这两种框架能相互提供什么:PPI研究者可以借鉴调查抽样文献中关于校准、最优分配和基于设计诊断的成熟理论,而调查抽样研究者则能从PPI对非标准估计量的扩展及其易用的软件生态系统中受益。本文最后呼吁这两个领域进行整合,其动因源于大型语言模型作为测量工具在应用研究中的日益广泛使用。