The Manokhin Probability Matrix: A Diagnostic Framework for Classifier Probability Quality

The Brier score conflates two distinct properties of probabilistic predictions: reliability (calibration error) and resolution (discriminatory power). We introduce the Manokhin Probability Matrix, a BCG-style two-dimensional diagnostic framework that separates them. Classifiers are placed on a 2x2 grid by Spiegelhalter Z-statistic and AUC-ROC expected rank, then assigned to one of four archetypes: Eagle (good on both axes), Bull (strong discrimination, poor calibration), Sloth (well-calibrated, weak discriminator), and Mole (poor on both). Each archetype carries a distinct prescription. We populate the matrix from a large-scale empirical study spanning 21 classifiers, 5 post-hoc calibrators, and 30 real-world binary classification tasks from the TabArena-v0.1 suite. The assignment is unambiguous. CatBoost, TabICL, EBM, TabPFN, GBC, and Random Forest are Eagles. XGBoost, LightGBM, and HGB are Bulls; Venn-Abers calibration cuts log-loss by 6.5 to 12.6% on Bulls but degrades Eagles by 2.1%. SVM, LR, LDA, and the empirical base-rate predictor are Sloths. MLP, KNN, Naive Bayes, and ExtraTrees are Moles. A theoretical asymmetry follows: no order-preserving post-hoc calibrator can add discriminatory power (Proposition 1), so calibration is the fixable part and discrimination is the hard part. The practical rule is direct: do not optimise aggregate Brier score without first decomposing it; optimise discrimination first, then fix calibration post-hoc. Code and raw experimental data are available at https://github.com/valeman/classifier_calibration.

翻译：Brier分数将概率预测的两个不同性质混为一谈：可靠性（校准误差）与分辨率（判别能力）。我们提出Manokhin概率矩阵，这是一种类似BCG矩阵的二维诊断框架，将两者分离。分类器依据Spiegelhalter Z统计量与AUC-ROC期望排名被置于2×2网格中，并归入四种原型之一：Eagle型（两轴均优）、Bull型（强判别、弱校准）、Sloth型（校准良好但判别力弱）、Mole型（两轴均劣）。每种原型对应不同的改进策略。我们在大规模实证研究中填充该矩阵，涵盖来自TabArena-v0.1套件的21个分类器、5种事后校准器及30个真实世界二分类任务。分类结果明确：CatBoost、TabICL、EBM、TabPFN、GBC和随机森林属于Eagle型；XGBoost、LightGBM和HGB属于Bull型；Venn-Abers校准使Bull型对数损失降低6.5%至12.6%，但使Eagle型退化2.1%；SVM、LR、LDA及经验基线预测器属Sloth型；MLP、KNN、朴素贝叶斯和ExtraTrees属Mole型。一个理论非对称性随之而来：任何保序事后校准器都无法增加判别能力（命题1），因此校准是可修正部分，而判别是困难部分。实践法则直观明确：在分解Brier分数之前切勿优化其聚合值；先优化判别能力，再事后修正校准偏差。代码与原始实验数据见https://github.com/valeman/classifier_calibration。