As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g. rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as outcome variables. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to this so-called ``post-prediction inference'' problem and elucidate three potential sources of error: (i) the relationship between predicted outcomes and their true, unobserved counterparts, (ii) robustness of the machine learning model to resampling or uncertainty about the training data, and (iii) appropriately propagating not just bias but also uncertainty from predictions into the ultimate inference procedure. We also contrast the framework for post-prediction inference with classical work spanning several related fields, including survey sampling, missing data, and semi-supervised learning. This contrast elucidates the role of design in both classical and modern inference problems.
翻译:随着人工智能和机器学习工具变得愈发普及,同时科学家在数据收集方面面临新障碍(例如成本上升、调查回复率下降),研究人员越来越多地使用预训练算法的预测结果作为因变量。尽管从财务和操作便利性角度具有吸引力,但当真实的未观测结果被预测值替代时,使用标准工具进行推断可能会错误地表征自变量与目标结果之间的关联。本文定量刻画了这种被称为“后预测推断”问题所固有的统计挑战,并阐明三种潜在误差来源:(i)预测结果与其真实未观测结果之间的关系;(ii)机器学习模型对重采样或训练数据不确定性的稳健性;(iii)不仅需恰当传递预测中的偏差,还需传递不确定性至最终推断过程。我们还将后预测推断框架与调查抽样、缺失数据和半监督学习等多个相关领域的经典研究进行对比。这种对比揭示了设计在经典与现代推断问题中的核心作用。