Need for Objective Task-based Evaluation of Deep Learning-Based Denoising Methods: A Study in the Context of Myocardial Perfusion SPECT

Artificial intelligence-based methods have generated substantial interest in nuclear medicine. An area of significant interest has been using deep-learning (DL)-based approaches for denoising images acquired with lower doses, shorter acquisition times, or both. Objective evaluation of these approaches is essential for clinical application. DL-based approaches for denoising nuclear-medicine images have typically been evaluated using fidelity-based figures of merit (FoMs) such as RMSE and SSIM. However, these images are acquired for clinical tasks and thus should be evaluated based on their performance in these tasks. Our objectives were to (1) investigate whether evaluation with these FoMs is consistent with objective clinical-task-based evaluation; (2) provide a theoretical analysis for determining the impact of denoising on signal-detection tasks; (3) demonstrate the utility of virtual clinical trials (VCTs) to evaluate DL-based methods. A VCT to evaluate a DL-based method for denoising myocardial perfusion SPECT (MPS) images was conducted. The impact of DL-based denoising was evaluated using fidelity-based FoMs and AUC, which quantified performance on detecting perfusion defects in MPS images as obtained using a model observer with anthropomorphic channels. Based on fidelity-based FoMs, denoising using the considered DL-based method led to significantly superior performance. However, based on ROC analysis, denoising did not improve, and in fact, often degraded detection-task performance. The results motivate the need for objective task-based evaluation of DL-based denoising approaches. Further, this study shows how VCTs provide a mechanism to conduct such evaluations using VCTs. Finally, our theoretical treatment reveals insights into the reasons for the limited performance of the denoising approach.

翻译：基于人工智能的方法在核医学领域引发了广泛关注。其中，利用深度学习技术对低剂量、短采集时间或两者兼有的图像进行去噪处理已成为重要研究方向。这些方法的客观评估对临床应用至关重要。目前，核医学图像去噪的深度学习方法通常采用RMSE和SSIM等基于保真度的评估指标进行评估。然而，这些图像是为临床任务获取的，因此应以其在具体任务中的表现作为评估依据。本研究旨在：(1)探讨基于保真度的评估指标是否与客观的临床任务评估结果一致；(2)从理论上分析去噪对信号检测任务的影响；(3)展示虚拟临床试验在评估深度学习方法中的实用价值。我们通过VCT评估了基于深度学习方法在心肌灌注SPECT图像去噪中的表现。采用基于保真度的评估指标与AUC（通过含有人形通道的模型观察者评估MPS图像中心肌灌注缺损检测性能）对深度学习去噪效果进行了量化分析。基于保真度指标显示，该深度学习方法显著提升了去噪性能。但ROC分析表明，去噪并未改善检测性能，反而常导致检测任务表现下降。这一结果凸显了对深度学习去噪方法进行客观任务评估的必要性。此外，本研究展示了VCT如何为这类评估提供实施机制。最后，通过理论分析揭示了该去噪方法性能受限的内在原因。