Evaluating rare-event forecasts is challenging because standard metrics collapse as event prevalence declines. Measures such as F1-score, AUPRC, MCC, and accuracy induce degenerate thresholds -- converging to zero or one -- and their values become dominated by class imbalance rather than tail discrimination. We develop a family of rare-event-stable (RES) metrics whose optimal thresholds remain strictly interior as the event probability approaches zero, ensuring coherent decision rules under extreme rarity. Simulations spanning event probabilities from 0.01 down to one in a million show that RES metrics maintain stable thresholds, consistent model rankings, and near-complete prevalence invariance, whereas traditional metrics exhibit statistically significant threshold drift and structural collapse. A credit-default application confirms these results: RES metrics yield interpretable probability-of-default cutoffs (4-9%) and remain robust under subsampling, while classical metrics fail operationally. The RES framework provides a principled, prevalence-invariant basis for evaluating extreme-risk forecasts.
翻译:评估罕见事件预测具有挑战性,因为随着事件发生率的下降,标准指标会失效。诸如F1分数、AUPRC、MCC和准确率等度量会诱导退化阈值——趋近于零或一——其数值受类别不平衡主导而非尾部判别能力。我们开发了一族罕见事件稳定(RES)指标,其最优阈值在事件概率趋近于零时保持严格内点性,从而确保在极端罕见性下具有一致的决策规则。模拟实验覆盖事件概率从0.01到百万分之一的广泛范围,结果表明RES指标能维持稳定的阈值、一致的模型排序以及近乎完全的流行度不变性,而传统指标则表现出统计显著的阈值漂移和结构崩溃。一项信用违约应用验证了这些结果:RES指标产生可解释的违约概率临界值(4-9%),并在子采样下保持鲁棒性,而经典指标在操作层面失效。RES框架为评估极端风险预测提供了一个具有理论依据、流行度不变的基准。