Are we using appropriate segmentation metrics? Identifying correlates of human expert perception for CNN training beyond rolling the DICE coefficient

Florian Kofler,Ivan Ezhov,Fabian Isensee,Fabian Balsiger,Christoph Berger,Maximilian Koerner,Beatrice Demiray,Julia Rackerseder,Johannes Paetzold,Hongwei Li,Suprosanna Shit,Richard McKinley,Marie Piraud,Spyridon Bakas,Claus Zimmer,Nassir Navab,Jan Kirschke,Benedikt Wiestler,Bjoern Menze

from arxiv, Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2023:002

Metrics optimized in complex machine learning tasks are often selected in an ad-hoc manner. It is unknown how they align with human expert perception. We explore the correlations between established quantitative segmentation quality metrics and qualitative evaluations by professionally trained human raters. Therefore, we conduct psychophysical experiments for two complex biomedical semantic segmentation problems. We discover that current standard metrics and loss functions correlate only moderately with the segmentation quality assessment of experts. Importantly, this effect is particularly pronounced for clinically relevant structures, such as the enhancing tumor compartment of glioma in brain magnetic resonance and grey matter in ultrasound imaging. It is often unclear how to optimize abstract metrics, such as human expert perception, in convolutional neural network (CNN) training. To cope with this challenge, we propose a novel strategy employing techniques of classical statistics to create complementary compound loss functions to better approximate human expert perception. Across all rating experiments, human experts consistently scored computer-generated segmentations better than the human-curated reference labels. Our results, therefore, strongly question many current practices in medical image segmentation and provide meaningful cues for future research.

翻译：在复杂机器学习任务中，优化所采用的评估指标往往是临时选定的。这些指标与人类专家感知之间的匹配程度尚不明确。我们探究了既有定量分割质量指标与经专业培训的人类评估者给出的定性评价之间的相关性。为此，我们针对两个复杂的生物医学语义分割问题开展了心理物理学实验。我们发现，当前的标准指标和损失函数与专家的分割质量评估仅呈中等程度的相关性。重要的是，这种效应在临床相关结构（如脑磁共振成像中的胶质瘤强化肿瘤区域和超声成像中的灰质）上尤为显著。在卷积神经网络（CNN）训练中，如何优化诸如人类专家感知之类的抽象指标通常并不明确。为应对这一挑战，我们提出了一种新策略，运用经典统计学技术构建互补性复合损失函数，以更好近似人类专家感知。在所有评分实验中，人类专家对计算机生成的分割结果的评分均一致优于人工校正的参考标签。因此，我们的结果对医学图像分割领域的诸多现行实践提出了强烈质疑，并为未来研究提供了有意义的启示。