Stratified Sampling for Model-Assisted Estimation with Surrogate Outcomes

In many randomized trials, outcomes such as essays or open-ended responses must be manually scored as a preliminary step to impact analysis, a process that is costly and limiting. Model-assisted estimation offers a way to combine surrogate outcomes generated by machine learning or large language models with a human-coded subset, yet typical implementations use simple random sampling and therefore overlook systematic variation in surrogate prediction error. We extend this framework by incorporating stratified sampling to more efficiently allocate human coding effort. We derive the exact variance of the stratified model-assisted estimator, characterize conditions under which stratification improves precision, and identify a Neyman-type optimal allocation rule that oversamples strata with larger residual variance. We evaluate our methods through a comprehensive simulation study to assess finite-sample performance. Overall, we find stratification consistently improves efficiency when surrogate prediction errors exhibit structured bias or heteroskedasticity. We also present two empirical applications, one using data from an education RCT and one using a large observational corpus, to illustrate how these methods can be implemented in practice using ChatGPT-generated surrogate outcomes. Overall, this framework provides a practical design-based approach for leveraging surrogate outcomes and strategically allocating human coding effort to obtain unbiased estimates with greater efficiency. While motivated by text-as-data applications, the methodology applies broadly to any setting where outcome measurement is costly or prohibitive, and can be applied to comparisons across groups or estimating the mean of a single group.

翻译：在许多随机试验中，诸如论文或开放式回答等结果必须经过人工评分，作为影响分析的初步步骤，这一过程成本高昂且具有局限性。模型辅助估计提供了一种方法，可将机器学习或大型语言模型生成的代理结果与人工编码的子集相结合，然而典型的实现方式采用简单随机抽样，因此忽视了代理预测误差的系统性变异。我们通过引入分层抽样来更有效地分配人工编码工作，从而扩展了这一框架。我们推导了分层模型辅助估计量的精确方差，描述了分层提高精度的条件，并确定了一种Neyman型最优分配规则，该规则对残差方差较大的层进行过采样。我们通过全面的模拟研究评估了我们的方法，以评估有限样本性能。总体而言，我们发现当代理预测误差表现出结构性偏差或异方差性时，分层能持续提高效率。我们还提出了两个实证应用，一个使用教育随机对照试验的数据，另一个使用大型观察语料库，以说明如何利用ChatGPT生成的代理结果在实践中实施这些方法。总体而言，该框架提供了一种实用的基于设计的方法，用于利用代理结果并策略性地分配人工编码工作，从而以更高的效率获得无偏估计。虽然该方法受文本即数据应用的启发，但其方法论广泛适用于任何结果测量成本高昂或难以进行的场景，并可应用于组间比较或单组均值的估计。