Toward Generalizable Machine Learning Models in Speech, Language, and Hearing Sciences: Estimating Sample Size and Reducing Overfitting

This study's first purpose is to provide quantitative evidence that would incentivize researchers to instead use the more robust method of nested cross-validation. The second purpose is to present methods and MATLAB codes for doing power analysis for ML-based analysis during the design of a study. Monte Carlo simulations were used to quantify the interactions between the employed cross-validation method, the discriminative power of features, the dimensionality of the feature space, and the dimensionality of the model. Four different cross-validations (single holdout, 10-fold, train-validation-test, and nested 10-fold) were compared based on the statistical power and statistical confidence of the ML models. Distributions of the null and alternative hypotheses were used to determine the minimum required sample size for obtaining a statistically significant outcome ({\alpha}=0.05, 1-\b{eta}=0.8). Statistical confidence of the model was defined as the probability of correct features being selected and hence being included in the final model. Our analysis showed that the model generated based on the single holdout method had very low statistical power and statistical confidence and that it significantly overestimated the accuracy. Conversely, the nested 10-fold cross-validation resulted in the highest statistical confidence and the highest statistical power, while providing an unbiased estimate of the accuracy. The required sample size with a single holdout could be 50% higher than what would be needed if nested cross-validation were used. Confidence in the model based on nested cross-validation was as much as four times higher than the confidence in the single holdout-based model. A computational model, MATLAB codes, and lookup tables are provided to assist researchers with estimating the sample size during the design of their future studies.

翻译：本研究首要目标是为研究者提供量化证据，鼓励其采用更具鲁棒性的嵌套交叉验证方法。其次，本研究旨在提出在实验设计阶段基于机器学习分析的统计功效分析方法及MATLAB代码。通过蒙特卡洛模拟量化了交叉验证方法、特征判别能力、特征空间维度与模型维度之间的相互作用。基于机器学习模型的统计功效和统计置信度，比较了四种交叉验证方法（单次留出法、十折交叉验证、训练-验证-测试法及嵌套十折交叉验证）。利用零假设与备择假设的分布确定达到统计学显著结果所需的最小样本量（α=0.05，1-β=0.8）。模型统计置信度定义为正确特征被选中并纳入最终模型的概率。分析表明：基于单次留出法生成的模型统计功效与统计置信度极低，且显著高估了准确率；而嵌套十折交叉验证在提供无偏准确率估计的同时，实现了最高的统计置信度与统计功效。使用单次留出法所需的样本量比采用嵌套交叉验证时高出50%。基于嵌套交叉验证的模型置信度最高可达单次留出法模型的四倍。本研究提供了计算模型、MATLAB代码和查找表，以协助研究者在未来实验设计阶段估计样本量。