Prediction Aided by Surrogate Training

We study a class of prediction problems in which relatively few observations have associated responses, but all observations include both standard covariates as well as additional "helper" covariates. While the end goal is to make high-quality predictions using only the standard covariates, helper covariates can be exploited during training to improve prediction. Helper covariates arise in many applications, including forecasting in time series; incorporation of biased or mis-calibrated predictions from foundation models; and sharing information in transfer learning. We propose "prediction aided by surrogate training" ($\texttt{PAST}$), a class of methods that exploit labeled data to construct a response estimator based on both the standard and helper covariates; and then use the full dataset with pseudo-responses to train a predictor based only on standard covariates. We establish guarantees on the prediction error of this procedure, with the response estimator allowed to be constructed in an arbitrary way, and the final predictor fit by empirical risk minimization over an arbitrary function class. These upper bounds involve the risk associated with the oracle data set (all responses available), plus an overhead that measures the accuracy of the pseudo-responses. This theory characterizes both regimes in which $\texttt{PAST}$ accuracy is comparable to the oracle accuracy, as well as more challenging regimes where it behaves poorly. We demonstrate its empirical performance across a range of applications, including forecasting of societal ills over time with future covariates as helpers; prediction of cardiovascular risk after heart attacks with prescription data as helpers; and diagnosing pneumonia from chest X-rays using machine-generated predictions as helpers.

翻译：我们研究一类预测问题，其中仅有少量观测数据具有对应的响应变量，但所有观测数据均包含标准协变量以及额外的"辅助"协变量。虽然最终目标仅使用标准协变量进行高质量预测，但辅助协变量可在训练阶段被利用以提升预测性能。辅助协变量广泛存在于诸多应用场景，包括时间序列预测、融合基础模型产生的有偏或未校准预测、以及迁移学习中的信息共享。我们提出"基于代理训练辅助的预测"（$\texttt{PAST}$）方法体系，该方法利用标注数据构建基于标准协变量与辅助协变量的响应估计器，继而使用包含伪响应的完整数据集训练仅基于标准协变量的预测器。我们建立了该过程的预测误差保证，允许响应估计器以任意方式构建，最终预测器通过经验风险最小化在任意函数类上拟合。这些上界包含理想数据集（所有响应可用）对应的风险，以及衡量伪响应准确性的附加项。该理论既刻画了$\texttt{PAST}$精度与理想精度相当的机制，也揭示了其表现欠佳的更具挑战性的机制。我们通过系列应用验证其经验性能，包括：以未来协变量为辅助的时间维度社会问题预测、以处方数据为辅助的心脏病发作后心血管风险预测、以及以机器生成预测为辅助的胸部X光肺炎诊断。