Beyond Point Estimates: Distributional Uncertainty in Machine Learning Performance Evaluation

Machine learning models are often evaluated using point estimates of performance metrics such as accuracy, F1 score, or mean squared error. Such summaries fail to capture the inherent variability induced by stochastic elements of the training process, including data splitting, initialization, and hyperparameter optimization. This work proposes a distributional perspective on model evaluation by treating performance metrics as random quantities rather than fixed values. Instead of focusing solely on aggregate measures, empirical distributions of performance metrics are analyzed using quantiles and corresponding confidence intervals. The study investigates point and interval estimation of quantiles based on real-data use cases for classification and regression tasks, complemented by simulation studies for validation. Special emphasis is placed on small sample sizes, reflecting practical constraints in machine learning, where repeated training is computationally expensive. The results show that meaningful statistical inference on the underlying performance distribution is feasible even with sample sizes in the range of 10-25, while standard nonparametric confidence interval remain applicable under these conditions. The proposed approach provides a more detailed characterization of variability and uncertainty compared to mean-based evaluation and enables a more differentiated comparison of models. In particular, it supports a risk-oriented interpretation of model performance, which is relevant in applications where reliability is critical. The presented methods are easy to implement and broadly applicable, making them a practical extension to standard performance evaluation procedures in machine learning.

翻译：机器学习模型通常使用点估计的绩效指标（如准确率、F1分数或均方误差）进行评估。此类总结未能捕捉训练过程中随机因素（包括数据划分、初始化和超参数优化）所导致的内在变异性。本研究提出一种分布视角的模型评估方法，将绩效指标视为随机变量而非固定值。通过分析绩效指标的实证分布（利用分位数及其对应置信区间），而非仅关注聚合度量，本研究探讨了基于真实数据用例的分类与回归任务中分位数的点估计与区间估计方法，并辅以模拟研究进行验证。特别关注小样本情景，这反映了机器学习中的实际约束——重复训练具有高昂的计算成本。结果表明，即使在样本量为10-25的范围内，也能对潜在绩效分布进行有意义的统计推断，且在此条件下标准非参数置信区间仍然适用。与基于均值的评估相比，所提方法更细致地刻画了变异性和不确定性，使模型比较更具区分度。尤其重要的是，该方法支持对模型性能进行风险导向的解释，这对可靠性至关重要的应用场景具有重要意义。本文方法易于实现且具有广泛适用性，可作为机器学习标准性能评估流程的实用扩展。