Automated Lesion Segmentation of Stroke MRI Using nnU-Net: A Comprehensive External Validation Across Acute and Chronic Lesions

Accurate and generalisable segmentation of stroke lesions from magnetic resonance imaging (MRI) is essential for advancing clinical research, prognostic modelling, and personalised interventions. Although deep learning has improved automated lesion delineation, many existing models are optimised for narrow imaging contexts and generalise poorly to independent datasets, modalities, and stroke stages. Here, we systematically evaluated stroke lesion segmentation using the nnU-Net framework across multiple heterogeneous, publicly available MRI datasets spanning acute and chronic stroke. Models were trained and tested on diffusion-weighted imaging (DWI), fluid-attenuated inversion recovery (FLAIR), and T1-weighted MRI, and evaluated on independent datasets. Across stroke stages, models showed robust generalisation, with segmentation accuracy approaching reported inter-rater reliability. Performance varied with imaging modality and training data characteristics. In acute stroke, DWI-trained models consistently outperformed FLAIR-based models, with only modest gains from multimodal combinations. In chronic stroke, increasing training set size improved performance, with diminishing returns beyond several hundred cases. Lesion volume was a key determinant of accuracy: smaller lesions were harder to segment, and models trained on restricted volume ranges generalised poorly. MRI image quality further constrained generalisability: models trained on lower-quality scans transferred poorly, whereas those trained on higher-quality data generalised well to noisier images. Discrepancies between predictions and reference masks were often attributable to limitations in manual annotations. Together, these findings show that automated lesion segmentation can approach human-level performance while identifying key factors governing generalisability and informing the development of lesion segmentation tools.

翻译：从磁共振成像（MRI）中实现卒中病灶的准确且可泛化的分割，对于推进临床研究、预后建模及个体化干预至关重要。尽管深度学习已提升了病灶自动勾画的性能，但现有模型大多针对狭窄的成像场景进行优化，在独立数据集、模态及卒中阶段上泛化能力较差。本研究系统评估了使用nnU-Net框架在多个涵盖急性与慢性卒中的公开异构MRI数据集上的病灶分割性能。模型在扩散加权成像（DWI）、液体衰减反转恢复（FLAIR）及T1加权MRI上进行训练与测试，并在独立数据集上评估。在不同卒中阶段，模型均表现出稳健的泛化能力，其分割精度接近已报道的评估者间一致性。性能表现随成像模态及训练数据特征而变化。在急性卒中中，基于DWI训练的模型始终优于基于FLAIR的模型，而多模态组合仅带来有限提升。在慢性卒中中，增加训练集规模可提升性能，但超过数百例后收益递减。病灶体积是影响精度的关键因素：较小病灶更难分割，且在受限体积范围上训练的模型泛化能力较差。MRI图像质量进一步制约了泛化能力：在低质量扫描上训练的模型迁移效果差，而在高质量数据上训练的模型能良好地泛化至噪声更强的图像。预测结果与参考标注间的差异常源于人工标注的局限性。综上，这些结果表明自动病灶分割可接近人类水平性能，同时揭示了影响泛化能力的关键因素，为病灶分割工具的研发提供了重要参考。