In two-phase multiwave sampling, inexpensive measurements are collected on a large sample and expensive, more informative measurements are adaptively obtained on subsets of units across multiple waves. Adaptively collecting the expensive measurements can increase efficiency but complicates statistical inference. We give valid estimators and confidence intervals for M-estimation under adaptive two-phase multiwave sampling. We focus on the case where proxies for the expensive variables -- such as predictions from pretrained machine learning models -- are available for all units and propose a Multiwave Predict-Then-Debias estimator that combines proxy information with the expensive, higher-quality measurements to improve efficiency while removing bias. We establish asymptotic linearity and normality and propose asymptotically valid confidence intervals. We also develop an approximately greedy sampling strategy that improves efficiency relative to uniform sampling. Data-based simulation studies support the theoretical results and demonstrate efficiency gains.
翻译:在两阶段多波次抽样中,首先在大样本上收集成本较低的测量值,随后在多轮抽样中自适应地在单元子集上获取成本较高但信息更丰富的测量值。自适应收集高成本测量值可提升效率,但会使统计推断复杂化。本文针对自适应两阶段多波次抽样提出有效的M估计量及置信区间。我们重点关注以下情形:所有单元均可获得高成本变量的代理变量(例如来自预训练机器学习模型的预测结果),并提出一种多波次预测-去偏估计量。该估计量将代理变量信息与高质量高成本测量值相结合,在消除偏差的同时提升估计效率。我们证明了估计量的渐近线性与正态性,并构建了渐近有效的置信区间。同时提出一种近似贪婪抽样策略,相较于均匀抽样可进一步提升效率。基于数据的模拟研究验证了理论结果,并证实了效率提升效应。