Machine learning classification methods usually assume that all possible classes are sufficiently present within the training set. Due to their inherent rarities, extreme events are always under-represented and classifiers tailored for predicting extremes need to be carefully designed to handle this under-representation. In this paper, we address the question of how to assess and compare classifiers with respect to their capacity to capture extreme occurrences. This is also related to the topic of scoring rules used in forecasting literature. In this context, we propose and study a risk function adapted to extremal classifiers. The inferential properties of our empirical risk estimator are derived under the framework of multivariate regular variation and hidden regular variation. A simulation study compares different classifiers and indicates their performance with respect to our risk function. To conclude, we apply our framework to the analysis of extreme river discharges in the Danube river basin. The application compares different predictive algorithms and test their capacity at forecasting river discharges from other river stations.
翻译:机器学习分类方法通常假设训练集中充分包含所有可能的类别。由于极端事件固有的稀疏性,其在数据集中始终处于欠表征状态,因此需要精心设计用于预测极端事件的分类器以应对这一欠表征问题。本文探讨了如何评估和比较分类器捕捉极端事件的能力,这一问题也与预测文献中使用的评分规则相关。在此背景下,我们提出并研究了一种适用于极端分类器的风险函数。在多元正则变化和隐正则变化框架下,推导了经验风险估计量的推断性质。通过模拟研究比较了不同分类器,并揭示了它们在我们提出的风险函数下的性能表现。最后,我们将该框架应用于多瑙河流域极端河流流量的分析,比较了不同预测算法在预测其他水文站河流流量方面的能力。