As advancements in novel biomarker-based algorithms and models accelerate disease risk prediction and stratification in medicine, it is crucial to evaluate these models within the context of their intended clinical application. Prediction models output the absolute risk of disease; subsequently, patient counseling and shared decision-making are based on the estimated individual risk and cost-benefit assessment. The overall impact of the application is often referred to as clinical utility, which received significant attention in terms of model assessment lately. The classic Brier score is a popular measure of prediction accuracy; however, it is insufficient for effectively assessing clinical utility. To address this limitation, we propose a class of weighted Brier scores that aligns with the decision-theoretic framework of clinical utility. Additionally, we decompose the weighted Brier score into discrimination and calibration components, examining how weighting influences the overall score and its individual components. Through this decomposition, we link the weighted Brier score to the $H$ measure, which has been proposed as a coherent alternative to the area under the receiver operating characteristic curve. This theoretical link to the $H$ measure further supports our weighting method and underscores the essential elements of discrimination and calibration in risk prediction evaluation. The practical use of the weighted Brier score as an overall summary is demonstrated using data from the Prostate Cancer Active Surveillance Study (PASS).
翻译:随着基于新型生物标志物的算法与模型在医学领域加速疾病风险预测与分层,在预期临床应用背景下评估这些模型变得至关重要。预测模型输出疾病的绝对风险;随后,患者咨询与共享决策将基于个体风险估计及成本效益评估。应用的整体影响常被称为临床效用,近期在模型评估领域受到广泛关注。经典Brier评分是预测准确性的常用度量指标,但其不足以有效评估临床效用。为突破此局限,我们提出一类与临床效用的决策理论框架相契合的加权Brier评分。此外,我们将加权Brier评分分解为区分度与校准度分量,探究加权机制如何影响整体评分及其各分量。通过此分解,我们将加权Brier评分与$H$度量建立理论关联——该度量已被提出作为受试者工作特征曲线下面积的一致性替代指标。与$H$度量的理论联系进一步佐证了我们的加权方法,并凸显了风险预测评估中区分度与校准度的核心价值。我们以前列腺癌主动监测研究(PASS)数据为例,展示了加权Brier评分作为整体性评估指标的实际应用价值。