A User-Focused Approach to Evaluating Probabilistic and Categorical Forecasts

We demonstrate a user-focused verification approach for evaluating probability forecasts of binary outcomes (also known as probabilistic classifiers) that is (i) based on proper scoring rules, (ii) focuses on user decision thresholds, and (iii) provides actionable insights. We argue that the widespread use of categorical performance diagrams and the critical success index to evaluate probabilistic forecasts may produce misleading results and instead illustrate how Murphy diagrams are better for understanding performance across user decision thresholds. The use of proper scoring rules that account for the relative importance of different user decision thresholds is shown to impact scores of overall performance, as well as supporting measures of discrimination and calibration. These methods are demonstrated by evaluating several probabilistic thunderstorm forecast systems. Furthermore, we illustrate an approach that allows a fair comparison between continuous probabilistic forecasts and categorical outlooks using the FIxed Risk Multicategorical (FIRM) score and establish the relationship between the FIRM score and Murphy diagrams. The results highlight how the performance of thunderstorm forecasts produced for tropical Australian waters varies between operational meteorologists and an automated system depending on what decision thresholds a user is acting on. A hindcast of a new automated system is shown to generally perform better than both meteorologists and the old automated system across tropical Australian waters. While the methods are illustrated using thunderstorm forecasts, they are applicable for evaluating probabilistic forecasts for any situation with binary outcomes.

翻译：我们提出了一种面向用户的验证方法，用于评估二元结果（也称为概率分类器）的概率预报。该方法（i）基于精确评分规则，（ii）聚焦于用户决策阈值，并（iii）提供可操作的见解。我们认为，广泛使用分类性能图和临界成功指数来评估概率预报可能产生误导性结果，并相应阐述了墨菲图如何能更好地理解不同用户决策阈值下的表现。通过纳入不同用户决策阈值的相对重要性，精确评分规则的使用被证明会影响整体性能得分，以及区分度和校准度的支持性度量。这些方法通过评估多个雷暴概率预报系统进行了演示。此外，我们阐述了一种方法，允许使用固定风险多分类（FIRM）评分对连续概率预报与分类预报进行公平比较，并建立了FIRM评分与墨菲图之间的关系。结果凸显了热带澳大利亚海域雷暴预报的表现如何根据用户所依据的决策阈值，在业务气象学家与自动化系统之间产生差异。一项新自动化系统的后报结果显示，在整个热带澳大利亚海域，其表现普遍优于气象学家和旧自动化系统。虽然这些方法通过雷暴预报进行了演示，但它们适用于评估任何二元结果情境下的概率预报。