Extended sample size calculations for evaluation of prediction models using a threshold for classification

Rebecca Whittle,Joie Ensor,Lucinda Archer,Gary S. Collins,Paula Dhiman,Alastair Denniston,Joseph Alderman,Amardeep Legha,Maarten van Smeden,Karel G. Moons,Jean-Baptiste Cazier,Richard D. Riley,Kym I. E. Snell

from arxiv, 27 pages, 1 figure

When evaluating the performance of a model for individualised risk prediction, the sample size needs to be large enough to precisely estimate the performance measures of interest. Current sample size guidance is based on precisely estimating calibration, discrimination, and net benefit, which should be the first stage of calculating the minimum required sample size. However, when a clinically important threshold is used for classification, other performance measures can also be used. We extend the previously published guidance to precisely estimate threshold-based performance measures. We have developed closed-form solutions to estimate the sample size required to target sufficiently precise estimates of accuracy, specificity, sensitivity, PPV, NPV, and F1-score in an external evaluation study of a prediction model with a binary outcome. This approach requires the user to pre-specify the target standard error and the expected value for each performance measure. We describe how the sample size formulae were derived and demonstrate their use in an example. Extension to time-to-event outcomes is also considered. In our examples, the minimum sample size required was lower than that required to precisely estimate the calibration slope, and we expect this would most often be the case. Our formulae, along with corresponding Python code and updated R and Stata commands (pmvalsampsize), enable researchers to calculate the minimum sample size needed to precisely estimate threshold-based performance measures in an external evaluation study. These criteria should be used alongside previously published criteria to precisely estimate the calibration, discrimination, and net-benefit.

翻译：在评估个体化风险预测模型的性能时，样本量需足够大以精确估计目标性能指标。现有样本量计算指南基于精确估计校准度、区分度和净效益，这应作为计算最小所需样本量的第一阶段。然而，当采用具有临床重要性的阈值进行分类时，还可使用其他性能指标。我们扩展了先前发布的指南，以精确估计基于阈值的性能指标。针对二元结局预测模型的外部评估研究，我们开发了闭式解来计算所需样本量，以实现对准确率、特异度、敏感度、阳性预测值、阴性预测值和F1分数的充分精确估计。该方法要求使用者预先设定每个性能指标的目标标准误和期望值。我们阐述了样本量计算公式的推导过程，并通过示例演示其应用。同时探讨了向时间-事件结局的扩展。在我们的示例中，精确估计基于阈值的性能指标所需的最小样本量低于精确估计校准斜率所需样本量，我们预期这种情况最为常见。我们提供的公式及相应的Python代码、更新的R和Stata命令（pmvalsampsize），使研究者能够计算在外部评估研究中精确估计基于阈值的性能指标所需的最小样本量。这些标准应与先前发布的精确估计校准度、区分度和净效益的标准结合使用。