Medical segmentation models are evaluated empirically. As such an evaluation is based on a limited set of example images, it is unavoidably noisy. Beyond a mean performance measure, reporting confidence intervals is thus crucial. However, this is rarely done in medical image segmentation. The width of the confidence interval depends on the test set size and on the spread of the performance measure (its standard-deviation across of the test set). For classification, many test images are needed to avoid wide confidence intervals. Segmentation, however, has not been studied, and it differs by the amount of information brought by a given test image. In this paper, we study the typical confidence intervals in medical image segmentation. We carry experiments on 3D image segmentation using the standard nnU-net framework, two datasets from the Medical Decathlon challenge and two performance measures: the Dice accuracy and the Hausdorff distance. We show that the parametric confidence intervals are reasonable approximations of the bootstrap estimates for varying test set sizes and spread of the performance metric. Importantly, we show that the test size needed to achieve a given precision is often much lower than for classification tasks. Typically, a 1% wide confidence interval requires about 100-200 test samples when the spread is low (standard-deviation around 3%). More difficult segmentation tasks may lead to higher spreads and require over 1000 samples.
翻译:医学分割模型通常基于有限数量的示例图像进行经验性评估,因此不可避免地存在噪声。除均值性能指标外,报告置信区间至关重要,然而这在医学图像分割领域鲜有实践。置信区间宽度取决于测试集规模及性能指标的离散程度(即测试集上的标准差)。在分类任务中,为避免置信区间过宽需要大量测试图像,但分割任务的研究尚不充分,且其单张测试图像所蕴含的信息量有所不同。本文系统研究了医学图像分割中典型的置信区间。我们采用标准nnU-Net框架,基于Medical Decathlon挑战赛的两个数据集,针对Dice精度和豪斯多夫距离两种性能指标,在三维图像分割场景中展开实验。结果表明,对于不同测试集规模和性能指标离散程度,参数化置信区间可合理逼近Bootstrap估计值。更重要的是,我们证实达到给定精度的所需测试样本量通常远小于分类任务。典型情况下,当离散程度较低(标准差约3%)时,1%宽度的置信区间约需100-200个测试样本;而更具挑战性的分割任务可能导致更高离散度,此时所需样本量可能超过1000个。