Non-inferiority of Deep Learning Acute Ischemic Stroke Segmentation on Non-Contrast CT Compared to Expert Neuroradiologists

Sophie Ostmeier,Brian Axelrod,Benjamin F. J. Verhaaren,Soren Christensen,Abdelkader Mahammedi,Yongkai Liu,Benjamin Pulli,Li-Jia Li,Greg Zaharchuk,Jeremy J. Heit

To determine if a convolutional neural network (CNN) deep learning model can accurately segment acute ischemic changes on non-contrast CT compared to neuroradiologists. Non-contrast CT (NCCT) examinations from 232 acute ischemic stroke patients who were enrolled in the DEFUSE 3 trial were included in this study. Three experienced neuroradiologists independently segmented hypodensity that reflected the ischemic core on each scan. The neuroradiologist with the most experience (expert A) served as the ground truth for deep learning model training. Two additional neuroradiologists (experts B and C) segmentations were used for data testing. The 232 studies were randomly split into training and test sets. The training set was further randomly divided into 5 folds with training and validation sets. A 3-dimensional CNN architecture was trained and optimized to predict the segmentations of expert A from NCCT. The performance of the model was assessed using a set of volume, overlap, and distance metrics using non-inferiority thresholds of 20%, 3ml, and 3mm. The optimized model trained on expert A was compared to test experts B and C. We used a one-sided Wilcoxon signed-rank test to test for the non-inferiority of the model-expert compared to the inter-expert agreement. The final model performance for the ischemic core segmentation task reached a performance of 0.46+-0.09 Surface Dice at Tolerance 5mm and 0.47+-0.13 Dice when trained on expert A. Compared to the two test neuroradiologists the model-expert agreement was non-inferior to the inter-expert agreement, p < 0.05. The CNN accurately delineates the hypodense ischemic core on NCCT in acute ischemic stroke patients with an accuracy comparable to neuroradiologists.

翻译：本研究旨在评估卷积神经网络（CNN）深度学习模型在平扫CT上分割急性缺血性改变是否具有与神经放射科医师相当的准确性。研究纳入DEFUSE 3试验中232例急性缺血性脑卒中患者的平扫CT（NCCT）检查。三位经验丰富的神经放射科医师独立分割每例扫描中反映缺血核心的低密度区域。选取其中最有经验的神经放射科医师（专家A）的分割结果作为深度学习模型训练的参考标准。另外两位神经放射科医师（专家B和C）的分割结果用于数据测试。将232例研究随机分为训练集和测试集，训练集进一步随机分为5折，每折包含训练集和验证集。训练并优化三维卷积神经网络架构，以预测专家A基于NCCT的分割结果。采用体积、重叠度和距离指标集评估模型性能，非劣效性阈值设定为20%、3ml和3mm。将基于专家A训练优化的模型与测试专家B和C进行比较。采用单侧Wilcoxon符号秩检验验证模型-专家一致性是否不劣于专家间一致性。最终模型在缺血核心分割任务中，以专家A训练时，5mm容差下的表面Dice系数达0.46±0.09，Dice系数达0.47±0.13。与两位测试神经放射科医师相比，模型-专家一致性不劣于专家间一致性（p<0.05）。该CNN能准确描绘急性缺血性脑卒中患者NCCT上的低密度缺血核心，其准确性可与神经放射科医师媲美。