In many risk prediction problems, covariates and a response surrogate are routinely available for a large target population, whereas the true response is costly to ascertain and is observed only for a limited subset. This creates a design problem: one must decide which observations should receive response measurement in order to build a prediction model under a fixed measurement budget. We propose a surrogate-assisted optimal sampling framework for risk prediction under measurement constraints. In the target setting, the surrogate identifies confirmed positive cases, while responses for surrogate-negative observations remain unobserved and can be selectively measured, and thus the sampling design determines how the response measurement budget is allocated. Our framework constructs an optimal sampling design minimizing the leading term of the expected out-of-sample cross-entropy loss and incorporates the resulting design into an inverse-probability-weighted cross-entropy estimator. The proposed design depends only on covariates, the surrogate, and a preliminary estimator, and therefore does not require responses from unlabeled observations at the design stage. We establish consistency, asymptotic normality, and leading-order prediction optimality of the resulting estimator. Extensive simulation studies and two real data applications demonstrate that the proposed design improves prediction performance and exhibits robustness under surrogate misspecification and rare outcome settings.
翻译:在许多风险预测问题中,协变量和响应代理可常规获取于大规模目标人群,而真实响应代价高昂,仅能在有限子集中观测到。这产生了一个设计问题:在固定测量预算下,必须决定哪些观测应进行响应测量以构建预测模型。我们提出了一种在测量约束下用于风险预测的代理辅助最优抽样框架。在目标设定中,代理可识别确证阳性病例,而代理阴性观测的响应保持未观测状态并可选择性测量,因此抽样设计决定了响应测量预算的分配方式。该框架构建了最小化期望样本外交叉熵损失主导项的最优抽样设计,并将所得设计融入逆概率加权交叉熵估计量中。所提设计仅依赖于协变量、代理和初步估计量,因此在设计阶段无需未标记观测的响应值。我们建立了所得估计量的一致性、渐近正态性及主导阶预测最优性。大量仿真研究和两个真实数据应用表明,所提设计能提升预测性能,并在代理模型误设及稀有结局场景下展现出鲁棒性。