Many machine learning tasks admit multiple models that perform almost equally well, a phenomenon known as predictive multiplicity. A fundamental source of this multiplicity is observational multiplicity, which arises from the stochastic nature of label collection: observed training labels represent only a single realization of the underlying ground-truth probabilities. While theoretical frameworks for observational multiplicity have been established for logistic regression, their implications for non-smooth, partition-based models like decision trees remain underexplored. In this paper, we introduce two complementary notions of observational multiplicity for decision tree classifiers: leaf regret and structural regret. Leaf regret quantifies the intrinsic variability of predictions within a fixed leaf due to finite-sample noise, while structural regret captures variability induced by the instability of the learned tree structure itself. We provide a formal decomposition of observational multiplicity into these two components and establish statistical guarantees. Our experimental evaluation across diverse credit risk scoring datasets confirms the near-perfect alignment between our theoretical decomposition and the empirically observed variance. Notably, we find that structural regret is the primary driver of observational multiplicity, accounting for over 15 times the variability of leaf regret in some datasets. Furthermore, we demonstrate that utilizing these regret measures as an abstention mechanism in selective prediction can effectively identify arbitrary regions and improve model safety, elevating recall from 92% to 100% on the most stable sub-populations. These results establish a rigorous framework for quantifying observational multiplicity, aligning with recent advances in algorithmic safety and interpretability.
翻译:许多机器学习任务允许多个性能几乎相当的模型共存,这一现象被称为预测多重性。该多重性的一个根本来源是观测多重性,其产生于标签收集的随机性:观测到的训练标签仅代表潜在真实概率的单一实现。尽管逻辑回归的观测多重性理论框架已建立,但其对决策树等非光滑、基于划分的模型的影响仍未得到充分探索。本文针对决策树分类器提出了两种互补的观测多重性概念:叶节点遗憾与结构遗憾。叶节点遗憾量化了固定叶节点内因有限样本噪声引起的预测内在变异性,而结构遗憾则捕捉了由学习到的树结构本身的不稳定性所诱发的变异性。我们提供了将观测多重性分解为这两个分量的形式化框架,并建立了统计保证。我们在多个信用风险评分数据集上的实验评估证实,理论分解与经验观测方差近乎完美吻合。值得注意的是,我们发现结构遗憾是观测多重性的主要驱动因素,在某些数据集中其变异性可达叶节点遗憾的15倍以上。此外,我们证明在选择性预测中将这些遗憾度量作为弃权机制,能够有效识别任意区域并提升模型安全性,在最稳定的子群体上将召回率从92%提升至100%。这些结果为量化观测多重性建立了严格框架,与算法安全性和可解释性的最新进展相契合。