Generalization to new samples is a fundamental rationale for statistical modeling. For this purpose, model validation is particularly important, but recent work in survey inference has suggested that simple aggregation of individual prediction scores does not give a good measure of the score for population aggregate estimates. In this manuscript we explain why this occurs, propose two scoring metrics designed specifically for this problem, and demonstrate their use in three different ways. We show that these scoring metrics correctly order models when compared to the true score, although they do underestimate the magnitude of the score. We demonstrate with a problem in survey research, where multilevel regression and poststratification (MRP) has been used extensively to adjust convenience and low-response surveys to make population and subpopulation estimates.
翻译:新样本的泛化是统计建模的基本理据。为此,模型验证尤为重要,但近期调查推断领域的研究表明,个体预测得分的简单聚合并不能有效衡量总体聚合估计的得分。本文阐释了这一现象的产生原因,提出两种专门针对该问题的评分指标,并以三种不同方式演示其应用。结果表明,尽管这些评分指标会低估真实得分的量级,但能正确排序模型与真实得分的对应关系。我们通过调查研究中的实际问题进行验证——在该领域中,多级回归与事后分层(MRP)已被广泛用于调整便利样本和低应答率调查,以生成总体及子总体估计。