The evaluation of segmentation performance is a common task in biomedical image analysis, with its importance emphasized in the recently released metrics selection guidelines and computing frameworks. To quantitatively evaluate the alignment of two segmentations, researchers commonly resort to counting metrics, such as the Dice similarity coefficient, or distance-based metrics, such as the Hausdorff distance, which are usually computed by publicly available open-source tools with an inherent assumption that these tools provide consistent results. In this study we questioned this assumption, and performed a systematic implementation analysis along with quantitative experiments on real-world clinical data to compare 11 open-source tools for distance-based metrics computation against our highly accurate mesh-based reference implementation. The results revealed that statistically significant differences among all open-source tools are both surprising and concerning, since they question the validity of existing studies. Besides identifying the main sources of variation, we also provide recommendations for distance-based metrics computation.
翻译:分割性能评估是生物医学图像分析中的常见任务,其重要性在近期发布的度量选择指南与计算框架中得到强调。为定量评估两个分割结果的对齐程度,研究者通常采用计数型度量(如Dice相似系数)或距离型度量(如豪斯多夫距离),这些度量一般通过公开开源工具计算,并隐含假设这些工具能提供一致结果。本研究对此假设提出质疑,通过系统性的实现分析与真实临床数据的定量实验,将11种距离型度量计算开源工具与我们高精度网格化参考实现进行对比。结果表明所有开源工具间存在统计学显著差异,这一现象既令人惊讶也值得警惕,因其对现有研究的有效性提出了根本性质疑。除识别主要变异来源外,本研究还提出了距离型度量计算的实施建议。