The variability and biases in the real-world performance benchmarking of deep learning models for medical imaging compromise their trustworthiness for real-world deployment. The common approach of holding out a single fixed test set fails to quantify the variance in the estimation of test performance metrics. This study introduces NACHOS (Nested and Automated Cross-validation and Hyperparameter Optimization using Supercomputing) to reduce and quantify the variance of test performance metrics of deep learning models. NACHOS integrates Nested Cross-Validation (NCV) and Automated Hyperparameter Optimization (AHPO) within a parallelized high-performance computing (HPC) framework. NACHOS was demonstrated on a chest X-ray repository and an Optical Coherence Tomography (OCT) dataset under multiple data partitioning schemes. Beyond performance estimation, DACHOS (Deployment with Automated Cross-validation and Hyperparameter Optimization using Supercomputing) is introduced to leverage AHPO and cross-validation to build the final model on the full dataset, improving expected deployment performance. The findings underscore the importance of NCV in quantifying and reducing estimation variance, AHPO in optimizing hyperparameters consistently across test folds, and HPC in ensuring computational feasibility. By integrating these methodologies, NACHOS and DACHOS provide a scalable, reproducible, and trustworthy framework for DL model evaluation and deployment in medical imaging.
翻译:医学影像深度学习模型在实际性能基准测试中的变异性和偏差损害了其在实际部署中的可信度。常用的固定单次测试集划分方法无法量化测试性能指标估计的方差。本研究提出NACHOS(基于超级计算的嵌套自动化交叉验证与超参数优化)方法,以降低并量化深度学习模型测试性能指标的方差。NACHOS将嵌套交叉验证(NCV)与自动化超参数优化(AHPO)整合于并行化高性能计算(HPC)框架中。该方法在胸部X光数据库和光学相干断层扫描(OCT)数据集上通过多种数据划分方案进行了验证。除性能评估外,本研究进一步提出DACHOS(基于超级计算的部署自动化交叉验证与超参数优化),利用AHPO与交叉验证在完整数据集上构建最终模型,从而提升预期部署性能。研究结果凸显了NCV在量化与降低估计方差、AHPO在测试折间保持超参数优化一致性、以及HPC在保障计算可行性方面的重要性。通过整合这些方法,NACHOS与DACHOS为医学影像领域的深度学习模型评估与部署提供了一个可扩展、可复现且可信赖的框架。