Accounting for Heavy Censoring in Evaluating the Risk Stratification Abilities of Existing Models for Time to Diagnosis of Huntington Disease

Huntington disease (HD) is a neurodegenerative disease with progressively worsening symptoms. Accurately modeling time to HD diagnosis is essential for clinical trial design. Langbehn's model, the CAG-Age Product (CAP) model, the Prognostic Index Normed (PIN) model, and the Multivariate Risk Score (MRS) model have all been proposed for this task. However, these models may yield conflicting predictions and few studies have systematically compared their performance. Further, those that have could be misleading due to testing the models on the same data used to train them and failing to account for high rates of right censoring (80%+) in performance metrics. We discuss the theoretical foundations of these models, offering intuitive comparisons about their practical feasibility. We externally validate their risk stratification abilities using data from the ENROLL-HD study and two censoring-appropriate performance metrics, guiding model selection for HD clinical trial design. As these models were developed in HD studies that ended more than a decade ago, we compared their predictive performance using published parameters versus updated ones (re-estimated using ENROLL-HD). We show how these models can be used to estimate sample sizes for an HD clinical trial. Based on either metric and using published or updated parameters, the MRS model, which incorporates the most covariates, performed best. However, the simpler PIN model offered similarly good performance while requiring fewer variables, many of which would require patients to undergo additional tests. In illustrating an HD clinical trial design, we defined an optimal threshold based on model performance metrics to determine which patients are more likely to be diagnosed. Sample size calculations using an optimal threshold based on metrics that did not account for censoring, as in previous studies, are shown to lead to underpowered trials.

翻译：亨廷顿病（HD）是一种症状逐渐恶化的神经退行性疾病。准确建模HD诊断时间对临床试验设计至关重要。针对此任务已提出了Langbehn模型、CAG-年龄乘积（CAP）模型、归一化预后指数（PIN）模型以及多变量风险评分（MRS）模型。然而，这些模型可能产生相互矛盾的预测，且少有研究系统比较其性能。此外，现有比较研究可能具有误导性，因其在训练数据相同的数据集上测试模型，且未在性能指标中考虑高达80%以上的右删失率。我们讨论了这些模型的理论基础，对其实际可行性进行了直观比较。利用ENROLL-HD研究数据和两个适用于删失场景的性能指标，我们外部验证了这些模型的风险分层能力，为HD临床试验设计的模型选择提供指导。由于这些模型均基于十余年前结束的HD研究开发，我们比较了使用已发布参数与更新参数（通过ENROLL-HD数据重新估计）时的预测性能。我们展示了如何利用这些模型估算HD临床试验的样本量。结果表明，无论采用何种指标、使用已发布或更新参数，纳入最多协变量的MRS模型均表现最佳。但更简洁的PIN模型在仅需较少变量的情况下提供了相近的优异性能，且其中许多变量要求患者接受额外检测。通过示例说明HD临床试验设计时，我们基于模型性能指标定义了最优阈值以确定哪些患者更可能被确诊。研究表明，若像既往研究那样采用未考虑删失的指标确定最优阈值进行样本量计算，将导致试验效能不足。