To safely deploy deep learning models in the clinic, a quality assurance framework is needed for routine or continuous monitoring of input-domain shift and the models' performance without ground truth contours. In this work, cardiac substructure segmentation was used as an example task to establish a QA framework. A benchmark dataset consisting of Computed Tomography (CT) images along with manual cardiac delineations of 241 patients were collected, including one 'common' image domain and five 'uncommon' domains. Segmentation models were tested on the benchmark dataset for an initial evaluation of model capacity and limitations. An image domain shift detector was developed by utilizing a trained Denoising autoencoder (DAE) and two hand-engineered features. Another Variational Autoencoder (VAE) was also trained to estimate the shape quality of the auto-segmentation results. Using the extracted features from the image/segmentation pair as inputs, a regression model was trained to predict the per-patient segmentation accuracy, measured by Dice coefficient similarity (DSC). The framework was tested across 19 segmentation models to evaluate the generalizability of the entire framework. As results, the predicted DSC of regression models achieved a mean absolute error (MAE) ranging from 0.036 to 0.046 with an averaged MAE of 0.041. When tested on the benchmark dataset, the performances of all segmentation models were not significantly affected by scanning parameters: FOV, slice thickness and reconstructions kernels. For input images with Poisson noise, CNN-based segmentation models demonstrated a decreased DSC ranging from 0.07 to 0.41, while the transformer-based model was not significantly affected.
翻译:为安全地将深度学习模型应用于临床,需要建立一套质量保证框架,用于在无真实轮廓标注的情况下,对输入域偏移及模型性能进行常规或持续监测。本研究以心脏亚结构分割为例,构建了一套质量保证框架。收集了包含241名患者CT图像及人工心脏勾画的基准数据集,涵盖一个"常见"图像域和五个"罕见"图像域。通过基准数据集对分割模型进行测试,初步评估模型能力与局限性。利用训练好的去噪自编码器(DAE)与两个手工设计特征,开发了图像域偏移检测器。另训练了变分自编码器(VAE)用于评估自动分割结果的形状质量。将图像/分割对提取的特征作为输入,训练回归模型预测患者个体分割精度(以Dice相似系数DSC衡量)。该框架在19种分割模型上测试,评估整体泛化能力。结果显示,回归模型预测的DSC平均绝对误差(MAE)范围为0.036至0.046,平均MAE为0.041。在基准数据集测试中,所有分割模型的性能均未受扫描参数(视野、层厚、重建核)显著影响。对于含泊松噪声的输入图像,基于CNN的分割模型DSC下降0.07至0.41,而基于Transformer的模型未受显著影响。