Scoring rules promote rational and honest decision-making, which is becoming increasingly important for automated procedures in `auto-ML'. In this paper we survey common squared and logarithmic scoring rules for survival analysis and determine which losses are proper and improper. We prove that commonly utilised squared and logarithmic scoring rules that are claimed to be proper are in fact improper, such as the Integrated Survival Brier Score (ISBS). We further prove that under a strict set of assumptions a class of scoring rules is strictly proper for, what we term, `approximate' survival losses. Despite the difference in properness, experiments in simulated and real-world datasets show there is no major difference between improper and proper versions of the widely-used ISBS, ensuring that we can reasonably trust previous experiments utilizing the original score for evaluation purposes. We still advocate for the use of proper scoring rules, as even minor differences between losses can have important implications in automated processes such as model tuning. We hope our findings encourage further research into the properties of survival measures so that robust and honest evaluation of survival models can be achieved.
翻译:评分规则促进理性且诚实的决策制定,这在"自动机器学习"的自动化流程中日益重要。本文系统考察了生存分析中常用的平方与对数评分规则,并确定了哪些损失函数具有适切性。我们证明,那些被宣称具有适切性的常用平方与对数评分规则(如综合生存Brier评分)实际上并不适切。进一步地,我们在严格假设条件下证明了一类评分规则对于"近似"生存损失函数具有严格适切性。尽管适切性存在差异,但在模拟数据集和真实数据集上的实验表明,广泛使用的综合生存Brier评分的非适切版本与适切版本之间并无显著差异,这确保我们可以合理信任先前使用原始评分进行模型评估的实验结果。我们仍主张采用适切评分规则,因为即使损失函数间的微小差异也可能对模型调参等自动化流程产生重要影响。本研究期望能推动对生存度量性质的深入探索,从而实现生存模型的稳健且可信评估。