The Brier score conflates two distinct properties of probabilistic predictions: reliability (calibration error) and resolution (discriminatory power). We introduce the Manokhin Probability Matrix, a BCG-style two-dimensional diagnostic framework that separates them. Classifiers are placed on a 2x2 grid by Spiegelhalter Z-statistic and AUC-ROC expected rank, then assigned to one of four archetypes: Eagle (good on both axes), Bull (strong discrimination, poor calibration), Sloth (well-calibrated, weak discriminator), and Mole (poor on both). Each archetype carries a distinct prescription. We populate the matrix from a large-scale empirical study spanning 21 classifiers, 5 post-hoc calibrators, and 30 real-world binary classification tasks from the TabArena-v0.1 suite. The assignment is unambiguous. CatBoost, TabICL, EBM, TabPFN, GBC, and Random Forest are Eagles. XGBoost, LightGBM, and HGB are Bulls; Venn-Abers calibration cuts log-loss by 6.5 to 12.6% on Bulls but degrades Eagles by 2.1%. SVM, LR, LDA, and the empirical base-rate predictor are Sloths. MLP, KNN, Naive Bayes, and ExtraTrees are Moles. A theoretical asymmetry follows: no order-preserving post-hoc calibrator can add discriminatory power (Proposition 1), so calibration is the fixable part and discrimination is the hard part. The practical rule is direct: do not optimise aggregate Brier score without first decomposing it; optimise discrimination first, then fix calibration post-hoc. Code and raw experimental data are available at https://github.com/valeman/classifier_calibration.
翻译:Brier分数将概率预测的两个不同性质混为一谈:可靠性(校准误差)与分辨率(判别能力)。我们提出Manokhin概率矩阵,这是一种类似BCG矩阵的二维诊断框架,将两者分离。分类器依据Spiegelhalter Z统计量与AUC-ROC期望排名被置于2×2网格中,并归入四种原型之一:Eagle型(两轴均优)、Bull型(强判别、弱校准)、Sloth型(校准良好但判别力弱)、Mole型(两轴均劣)。每种原型对应不同的改进策略。我们在大规模实证研究中填充该矩阵,涵盖来自TabArena-v0.1套件的21个分类器、5种事后校准器及30个真实世界二分类任务。分类结果明确:CatBoost、TabICL、EBM、TabPFN、GBC和随机森林属于Eagle型;XGBoost、LightGBM和HGB属于Bull型;Venn-Abers校准使Bull型对数损失降低6.5%至12.6%,但使Eagle型退化2.1%;SVM、LR、LDA及经验基线预测器属Sloth型;MLP、KNN、朴素贝叶斯和ExtraTrees属Mole型。一个理论非对称性随之而来:任何保序事后校准器都无法增加判别能力(命题1),因此校准是可修正部分,而判别是困难部分。实践法则直观明确:在分解Brier分数之前切勿优化其聚合值;先优化判别能力,再事后修正校准偏差。代码与原始实验数据见https://github.com/valeman/classifier_calibration。