The generalization performance of deep learning models for medical image analysis often decreases on images collected with different devices for data acquisition, device settings, or patient population. A better understanding of the generalization capacity on new images is crucial for clinicians' trustworthiness in deep learning. Although significant research efforts have been recently directed toward establishing generalization bounds and complexity measures, still, there is often a significant discrepancy between the predicted and actual generalization performance. As well, related large empirical studies have been primarily based on validation with general-purpose image datasets. This paper presents an empirical study that investigates the correlation between 25 complexity measures and the generalization abilities of supervised deep learning classifiers for breast ultrasound images. The results indicate that PAC-Bayes flatness-based and path norm-based measures produce the most consistent explanation for the combination of models and data. We also investigate the use of multi-task classification and segmentation approach for breast images, and report that such learning approach acts as an implicit regularizer and is conducive toward improved generalization.
翻译:深度学习模型在医学图像分析中的泛化性能常因采集设备、设备设置或患者群体的不同而下降。理解模型在新图像上的泛化能力对提升临床医生对深度学习的信任至关重要。尽管近期大量研究致力于建立泛化边界与复杂度度量,但理论预测与实际泛化性能之间仍存在显著差异。此外,相关大规模实证研究主要基于通用图像数据集的验证。本文通过实证研究,系统分析了25种复杂度度量与乳腺超声图像监督式深度学习分类器泛化能力之间的关联性。结果表明,基于PAC-Bayes平坦度与路径范数的度量方法对模型与数据的组合具有最一致的解释力。我们还探究了乳腺图像多任务分类与分割方法的协同作用,研究发现此类学习方法具有隐式正则化效应,有助于提升模型泛化性能。