As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g. rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as outcome variables. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to this so-called ``inference with predicted data'' problem and elucidate three potential sources of error: (i) the relationship between predicted outcomes and their true, unobserved counterparts, (ii) robustness of the machine learning model to resampling or uncertainty about the training data, and (iii) appropriately propagating not just bias but also uncertainty from predictions into the ultimate inference procedure.
翻译:随着人工智能和机器学习工具日益普及,科学家们在数据收集方面面临新的障碍(例如成本上升、调查回复率下降),研究者们越来越多地使用预训练算法的预测结果作为结果变量。尽管出于财务和后勤方面的考虑这一做法颇具吸引力,但当真实、未观测到的结果被预测值替代时,使用标准工具进行推断可能会歪曲自变量与感兴趣结果之间的关联。在本文中,我们刻画了这一所谓"基于预测数据的推断"问题所固有的统计挑战,并阐明三种潜在误差来源:(i)预测结果与其真实未观测结果之间的关系,(ii)机器学习模型对重采样或训练数据不确定性的鲁棒性,以及(iii)不仅适当传播偏差,还要将预测的不确定性传播到最终推断过程中。