An important issue in medical image processing is to be able to estimate not only the performances of algorithms but also the precision of the estimation of these performances. Reporting precision typically amounts to reporting standard-error of the mean (SEM) or equivalently confidence intervals. However, this is rarely done in medical image segmentation studies. In this paper, we aim to estimate what is the typical confidence that can be expected in such studies. To that end, we first perform experiments for Dice metric estimation using a standard deep learning model (U-net) and a classical task from the Medical Segmentation Decathlon. We extensively study precision estimation using both Gaussian assumption and bootstrapping (which does not require any assumption on the distribution). We then perform simulations for other test set sizes and performance spreads. Overall, our work shows that small test sets lead to wide confidence intervals (e.g. $\sim$8 points of Dice for 20 samples with $\sigma \simeq 10$).
翻译:医学图像处理中的一个重要问题是不仅能够估计算法的性能,还能估计这些性能估计的精度。报告精度通常意味着报告均值的标准误差(SEM)或等效的置信区间。然而,这在医学图像分割研究中很少做到。本文旨在估计此类研究中通常可预期的典型置信水平。为此,我们首先使用标准深度学习模型(U-net)和Medical Segmentation Decathlon中的经典任务进行Dice度量估计实验。我们利用高斯假设和自助法(无需任何分布假设)对精度估计进行了广泛研究。随后,我们针对其他测试集大小和性能分布进行了模拟。总体而言,我们的工作表明,小的测试集会导致宽的置信区间(例如,在标准差σ≈10情况下,20个样本的Dice分数置信区间宽度约为8个点)。