Beyond Point Estimates: Reliable Evaluation of Prediction Performance Metrics under Clustered Data

Prediction performance metrics such as accuracy and the F1 score are typically reported as single numbers, with no measure of uncertainty. The omission has been tolerable in exploratory settings, where model evaluation is used for informal comparison rather than formal decision-making. But as machine learning is deployed in real-world applications, evaluation results are increasingly used to support binary decisions -- whether a model meets a required standard or not -- making uncertainty quantification essential. The problem is compounded when data are dependent, as in repeated measurements, clustered subjects, or time series, where variability is harder to assess and easy to underestimate. We develop a unified framework that links a broad class of performance metrics through their representation as smooth functionals of confusion-matrix probabilities. This representation allows the use of the cluster-robust sandwich variance estimator to obtain asymptotically valid confidence intervals, hypothesis tests, and paired model comparisons for both binary and multiclass problems under clustered data. We also provide power and sample size approximations based on pilot data, enabling principled study design for model evaluation. Simulations show that the proposed methods achieve near-nominal coverage across a range of dependence structures, while naive methods underestimate variability. A real-data application further illustrates how accounting for clustering can materially change conclusions. These results offer a practical foundation for uncertainty quantification and study design in prediction performance evaluation, in settings where decisions should be justified under dependent and clustered data.

翻译：预测性能指标如准确率和F1分数通常以单一数值报告，缺乏不确定性度量。这种省略在探索性设置中尚可接受，因为模型评估仅用于非正式比较而非正式决策。但随着机器学习在现实应用中的部署，评估结果越来越多地用于支持二元决策——模型是否达到所需标准——这使得不确定性量化变得至关重要。当数据存在依赖性时（如重复测量、聚类受试者或时间序列），问题更为复杂，因为变异性更难评估且容易被低估。我们开发了一个统一框架，通过将广泛类别的性能指标表示为混淆矩阵概率的平滑泛函。这种表示允许使用聚类稳健的夹心方差估计器，在聚类数据下为二元和多类问题获得渐近有效的置信区间、假设检验及配对模型比较。我们还基于先导数据提供功效和样本量近似，实现模型评估的原则性研究设计。模拟表明，所提方法在多种依赖结构下达到近乎标称的覆盖率，而朴素方法则低估变异性。真实数据应用进一步说明考虑聚类如何实质性改变结论。这些结果为依赖和聚类数据背景下预测性能评估的不确定性量化和研究设计提供了实用基础。