Recent work has focused on the very common practice of prediction-based inference: that is, (i) using a pre-trained machine learning model to predict an unobserved response variable, and then (ii) conducting inference on the association between that predicted response and some covariates. As pointed out by Wang et al. [2020], applying a standard inferential approach in (ii) does not accurately quantify the association between the unobserved (as opposed to the predicted) response and the covariates. In recent work, Wang et al. [2020] and Angelopoulos et al. [2023] propose corrections to step (ii) in order to enable valid inference on the association between the unobserved response and the covariates. Here, we show that the method proposed by Angelopoulos et al. [2023] successfully controls the type 1 error rate and provides confidence intervals with correct nominal coverage, regardless of the quality of the pre-trained machine learning model used to predict the unobserved response. However, the method proposed by Wang et al. [2020] provides valid inference only under very strong conditions that rarely hold in practice: for instance, if the machine learning model perfectly approximates the true regression function in the study population of interest.
翻译:近期研究聚焦于一种非常普遍的基于预测的推断实践:即(i)使用预训练机器学习模型预测未观测的响应变量,然后(ii)对该预测响应与某些协变量之间的关联进行推断。如Wang等人[2020]所指出的,在步骤(ii)中应用标准推断方法并不能准确量化未观测响应(而非预测响应)与协变量之间的关联。在近期工作中,Wang等人[2020]和Angelopoulos等人[2023]提出了针对步骤(ii)的校正方法,以实现对未观测响应与协变量之间关联的有效推断。本文证明:无论用于预测未观测响应的预训练机器学习模型质量如何,Angelopoulos等人[2023]提出的方法都能成功控制第一类错误率,并提供具有正确名义覆盖率的置信区间。然而,Wang等人[2020]提出的方法仅在极强条件下(例如机器学习模型完美逼近目标研究总体中的真实回归函数)才能提供有效推断,而这些条件在实践中几乎无法满足。