Toward Generalizable Machine Learning Models in Speech, Language, and Hearing Sciences: Power Analysis and Sample Size Estimation

This study's first purpose is to provide quantitative evidence that would incentivize researchers to instead use the more robust method of nested cross-validation. The second purpose is to present methods and MATLAB codes for doing power analysis for ML-based analysis during the design of a study. Monte Carlo simulations were used to quantify the interactions between the employed cross-validation method, the discriminative power of features, the dimensionality of the feature space, and the dimensionality of the model. Four different cross-validations (single holdout, 10-fold, train-validation-test, and nested 10-fold) were compared based on the statistical power and statistical confidence of the ML models. Distributions of the null and alternative hypotheses were used to determine the minimum required sample size for obtaining a statistically significant outcome ({\alpha}=0.05, 1-\b{eta}=0.8). Statistical confidence of the model was defined as the probability of correct features being selected and hence being included in the final model. Our analysis showed that the model generated based on the single holdout method had very low statistical power and statistical confidence and that it significantly overestimated the accuracy. Conversely, the nested 10-fold cross-validation resulted in the highest statistical confidence and the highest statistical power, while providing an unbiased estimate of the accuracy. The required sample size with a single holdout could be 50% higher than what would be needed if nested cross-validation were used. Confidence in the model based on nested cross-validation was as much as four times higher than the confidence in the single holdout-based model. A computational model, MATLAB codes, and lookup tables are provided to assist researchers with estimating the sample size during the design of their future studies.

翻译：本研究的第一目的是提供定量证据，推动研究者转而采用更稳健的嵌套交叉验证方法。第二目的是在实验设计阶段，提出基于机器学习的分析中统计功效分析的方法及MATLAB代码。采用蒙特卡洛模拟量化交叉验证方法、特征判别能力、特征空间维度和模型维度之间的交互作用。基于统计功效和统计置信度，比较了四种交叉验证方法（单次留出法、10折交叉验证、训练-验证-测试集划分、嵌套10折交叉验证）。通过零假设与备择假设的分布确定获得统计显著结果所需的最小样本量（α=0.05，1-β=0.8）。模型的统计置信度定义为正确特征被选中并纳入最终模型的概率。分析表明：基于单次留出法生成的模型统计功效与置信度极低，且显著高估准确率；相反，嵌套10折交叉验证在提供无偏准确率估计的同时，获得了最高的统计置信度与统计功效。采用单次留出法所需样本量可比嵌套交叉验证法高出50%。基于嵌套交叉验证的模型置信度可达到基于单次留出法模型的四倍。本研究提供了计算模型、MATLAB代码及查询表，以协助研究者在未来实验设计阶段估算样本量。