Surveys are commonly used to facilitate research in epidemiology, health, and the social and behavioral sciences. Often, these surveys are not simple random samples, and respondents are given weights reflecting their probability of selection into the survey. It is well known that analysts can use these survey weights to produce unbiased estimates of population quantities like totals. In this article, we show that survey weights also can be beneficial for evaluating the quality of predictive models when splitting data into training and test sets. In particular, we characterize model assessment statistics, such as sensitivity and specificity, as finite population quantities, and compute survey-weighted estimates of these quantities with sample test data comprising a random subset of the original data.Using simulations with data from the National Survey on Drug Use and Health and the National Comorbidity Survey, we show that unweighted metrics estimated with sample test data can misrepresent population performance, but weighted metrics appropriately adjust for the complex sampling design. We also show that this conclusion holds for models trained using upsampling for mitigating class imbalance. The results suggest that weighted metrics should be used when evaluating performance on sample test data.
翻译:调查常用于流行病学、健康以及社会与行为科学研究。通常,这些调查并非简单随机样本,受访者会被赋予反映其被选入调查概率的权重。众所周知,分析人员可利用这些调查权重对总体总量等参数进行无偏估计。本文表明,在将数据划分为训练集和测试集时,调查权重同样有助于评估预测模型的质量。具体而言,我们将敏感度和特异度等模型评估统计量定义为有限总体参数,并利用包含原始数据随机子集的样本测试数据计算这些参数的调查加权估计。通过使用国家药物滥用与健康调查及国家共病调查数据的模拟实验,我们发现:基于样本测试数据估计的未加权指标可能错误反映总体性能,而加权指标能适当校正复杂抽样设计带来的偏差。我们还证明,这一结论适用于使用上采样缓解类别不平衡训练的模型。结果表明,在使用样本测试数据评估性能时应采用加权指标。