Toward Generalizable Machine Learning Models in Speech, Language, and Hearing Sciences: Sample Size Estimation and Reducing Overfitting

This study's first purpose is to provide quantitative evidence that would incentivize researchers to instead use the more robust method of nested cross-validation. The second purpose is to present methods and MATLAB codes for doing power analysis for ML-based analysis during the design of a study. Monte Carlo simulations were used to quantify the interactions between the employed cross-validation method, the discriminative power of features, the dimensionality of the feature space, and the dimensionality of the model. Four different cross-validations (single holdout, 10-fold, train-validation-test, and nested 10-fold) were compared based on the statistical power and statistical confidence of the ML models. Distributions of the null and alternative hypotheses were used to determine the minimum required sample size for obtaining a statistically significant outcome ({\alpha}=0.05, 1-\b{eta}=0.8). Statistical confidence of the model was defined as the probability of correct features being selected and hence being included in the final model. Our analysis showed that the model generated based on the single holdout method had very low statistical power and statistical confidence and that it significantly overestimated the accuracy. Conversely, the nested 10-fold cross-validation resulted in the highest statistical confidence and the highest statistical power, while providing an unbiased estimate of the accuracy. The required sample size with a single holdout could be 50% higher than what would be needed if nested cross-validation were used. Confidence in the model based on nested cross-validation was as much as four times higher than the confidence in the single holdout-based model. A computational model, MATLAB codes, and lookup tables are provided to assist researchers with estimating the sample size during the design of their future studies.

翻译：本研究的第一目的是提供量化证据，以激励研究者采用更具鲁棒性的嵌套交叉验证方法。第二目的是介绍在实验设计阶段基于机器学习的分析中进行功效分析的方法及MATLAB代码。通过蒙特卡洛模拟量化了所采用的交叉验证方法、特征判别能力、特征空间维度及模型维度之间的交互作用。基于统计功效和统计置信度比较了四种交叉验证方法（单一留出法、10折法、训练-验证-测试法和嵌套10折法）。利用零假设与备择假设的分布确定获得统计学显著结果所需的最小样本量（α=0.05，1-β=0.8）。模型的统计置信度定义为正确特征被选中并纳入最终模型的概率。分析表明：基于单一留出法生成的模型统计功效和统计置信度极低，且显著高估了准确率；相比之下，嵌套10折交叉验证在提供无偏准确率估计的同时，实现了最高的统计置信度和统计功效。采用单一留出法所需的样本量可能比使用嵌套交叉验证时高出50%。基于嵌套交叉验证的模型置信度最高可达基于单一留出法模型的四倍。本文提供计算模型、MATLAB代码及查找表，以协助研究者在未来实验设计阶段进行样本量估计。