Performance Deterioration of Deep Learning Models after Clinical Deployment: A Case Study with Auto-segmentation for Definitive Prostate Cancer Radiotherapy

Performer · MoDELS · cancer · 分解的 · Learning ·

2023 年 11 月 16 日

翻译：深度学习模型临床部署后的性能退化：基于自动分割技术在前列腺癌根治性放疗中的案例研究

Biling Wang,Michael Dohopolski,Ti Bai,Junjie Wu,Raquibul Hannan,Neil Desai,Aurelie Garant,Daniel Yang,Dan Nguyen,Mu-Han Lin,Robert Timmerman,Xinlei Wang,Steve Jiang

We evaluated the temporal performance of a deep learning (DL) based artificial intelligence (AI) model for auto segmentation in prostate radiotherapy, seeking to correlate its efficacy with changes in clinical landscapes. Our study involved 1328 prostate cancer patients who underwent definitive radiotherapy from January 2006 to August 2022 at the University of Texas Southwestern Medical Center. We trained a UNet based segmentation model on data from 2006 to 2011 and tested it on data from 2012 to 2022 to simulate real world clinical deployment. We measured the model performance using the Dice similarity coefficient (DSC), visualized the trends in contour quality using exponentially weighted moving average (EMA) curves. Additionally, we performed Wilcoxon Rank Sum Test to analyze the differences in DSC distributions across distinct periods, and multiple linear regression to investigate the impact of various clinical factors. The model exhibited peak performance in the initial phase (from 2012 to 2014) for segmenting the prostate, rectum, and bladder. However, we observed a notable decline in performance for the prostate and rectum after 2015, while bladder contour quality remained stable. Key factors that impacted the prostate contour quality included physician contouring styles, the use of various hydrogel spacer, CT scan slice thickness, MRI-guided contouring, and using intravenous (IV) contrast. Rectum contour quality was influenced by factors such as slice thickness, physician contouring styles, and the use of various hydrogel spacers. The bladder contour quality was primarily affected by using IV contrast. This study highlights the challenges in maintaining AI model performance consistency in a dynamic clinical setting. It underscores the need for continuous monitoring and updating of AI models to ensure their ongoing effectiveness and relevance in patient care.

翻译：我们评估了基于深度学习的人工智能模型在前列腺放疗自动分割任务中的时序性能，旨在探究其效能与临床环境变化之间的关联。本研究纳入2006年1月至2022年8月期间在德克萨斯大学西南医学中心接受根治性放疗的1328例前列腺癌患者。我们采用2006年至2011年的数据训练基于UNet的分割模型，并在2012年至2022年的数据上进行测试，以模拟真实临床部署场景。通过Dice相似系数（DSC）量化模型性能，并利用指数加权移动平均（EMA）曲线可视化轮廓质量变化趋势。此外，采用Wilcoxon秩和检验分析不同时期DSC分布的差异，通过多元线性回归探究多种临床因素对模型性能的影响。模型在前列腺、直肠和膀胱分割的初始阶段（2012年至2014年）表现最佳。然而，自2015年起，前列腺和直肠分割性能出现显著下降，而膀胱轮廓质量保持稳定。影响前列腺分割质量的关键因素包括：医师勾画风格、水凝胶间隔物使用类型、CT扫描层厚、MRI引导勾画以及静脉造影剂使用。直肠轮廓质量受层厚、医师勾画风格及水凝胶间隔物类型共同影响，而膀胱轮廓质量主要受静脉造影剂影响。本研究揭示了动态临床环境中维持AI模型性能一致性的挑战，强调需对AI模型进行持续监测与更新，以确保其在患者治疗中的长期有效性和相关性。