Need for Objective Task-based Evaluation of Deep Learning-Based Denoising Methods: A Study in the Context of Myocardial Perfusion SPECT

Artificial intelligence-based methods have generated substantial interest in nuclear medicine. An area of significant interest has been using deep-learning (DL)-based approaches for denoising images acquired with lower doses, shorter acquisition times, or both. Objective evaluation of these approaches is essential for clinical application. DL-based approaches for denoising nuclear-medicine images have typically been evaluated using fidelity-based figures of merit (FoMs) such as RMSE and SSIM. However, these images are acquired for clinical tasks and thus should be evaluated based on their performance in these tasks. Our objectives were to (1) investigate whether evaluation with these FoMs is consistent with objective clinical-task-based evaluation; (2) provide a theoretical analysis for determining the impact of denoising on signal-detection tasks; (3) demonstrate the utility of virtual clinical trials (VCTs) to evaluate DL-based methods. A VCT to evaluate a DL-based method for denoising myocardial perfusion SPECT (MPS) images was conducted. The impact of DL-based denoising was evaluated using fidelity-based FoMs and AUC, which quantified performance on detecting perfusion defects in MPS images as obtained using a model observer with anthropomorphic channels. Based on fidelity-based FoMs, denoising using the considered DL-based method led to significantly superior performance. However, based on ROC analysis, denoising did not improve, and in fact, often degraded detection-task performance. The results motivate the need for objective task-based evaluation of DL-based denoising approaches. Further, this study shows how VCTs provide a mechanism to conduct such evaluations using VCTs. Finally, our theoretical treatment reveals insights into the reasons for the limited performance of the denoising approach.

翻译：基于人工智能的方法在核医学领域引起了广泛关注。其中，利用深度学习技术对低剂量、短采集时间或两者兼具的图像进行去噪处理已成为重要研究方向。这类方法的客观评估对临床应用至关重要。目前，核医学图像深度学习去噪方法的评估通常采用RMSE、SSIM等保真度指标。然而，这些图像是为临床任务采集的，因此应基于其在相关任务中的表现进行评估。本研究旨在：(1)探究这些保真度指标评估结果是否与客观临床任务评估结果一致；(2)提供理论分析阐明去噪对信号检测任务的影响机制；(3)展示虚拟临床试验在评估深度学习方法中的实用价值。我们开展了基于虚拟临床试验评估心肌灌注SPECT图像深度学习去噪方法的实验。采用保真度指标和基于类人信道模型观察者检测心肌灌注缺陷的AUC值，对深度学习去噪效果进行量化评估。保真度指标显示，所采用的深度学习方法实现了显著更优的去噪性能。但ROC分析表明，去噪并未改善检测任务表现，实际上反而常导致性能下降。这一结果凸显了深度学习去噪方法需要客观任务评估的迫切性。此外，本研究展示了虚拟临床试验可为此类评估提供有效机制。最后，理论研究揭示了去噪方法性能受限的内在机制。