Image-to-image translation can create large impact in medical imaging, for instance the possibility to synthetically transform images to other modalities, sequence types, higher resolutions or lower noise levels. In order to assure a high level of patient safety, these methods are mostly validated by human reader studies, which require a considerable amount of time and costs. Quantitative metrics have been used to complement such studies and to provide reproducible and objective assessment of synthetic images. Even though the SSIM and PSNR metrics are extensively used, they do not detect all types of errors in synthetic images as desired. Other metrics could provide additional useful evaluation. In this study, we give an overview and a quantitative analysis of 15 metrics for assessing the quality of synthetically generated images. We include 11 full-reference metrics (SSIM, MS-SSIM, CW-SSIM, PSNR, MSE, NMSE, MAE, LPIPS, DISTS, NMI and PCC), three non-reference metrics (BLUR, MLC, MSLC) and one downstream task segmentation metric (DICE) to detect 11 kinds of typical distortions and artifacts that occur in MR images. In addition, we analyze the influence of four prominent normalization methods (Minmax, cMinmax, Zscore and Quantile) on the different metrics and distortions. Finally, we provide adverse examples to highlight pitfalls in metric assessment and derive recommendations for effective usage of the analyzed similarity metrics for evaluation of image-to-image translation models.
翻译:图像间转换技术在医学影像领域具有重要应用价值,例如能够将图像合成转换为其他模态、序列类型、更高分辨率或更低噪声水平。为确保患者安全的高标准,这类方法主要通过人工阅片研究进行验证,但此类研究需要耗费大量时间和成本。定量指标常被用于补充此类研究,并为合成图像提供可重复的客观评估。尽管SSIM和PSNR指标被广泛使用,但它们无法如预期般检测合成图像中所有类型的误差。其他指标可能提供额外的有效评估。本研究系统综述并定量分析了15种用于评估合成生成图像质量的指标。我们纳入了11种全参考指标(SSIM、MS-SSIM、CW-SSIM、PSNR、MSE、NMSE、MAE、LPIPS、DISTS、NMI和PCC)、三种无参考指标(BLUR、MLC、MSLC)以及一种下游任务分割指标(DICE),用于检测磁共振图像中出现的11类典型畸变与伪影。此外,我们分析了四种主流归一化方法(Minmax、cMinmax、Zscore和Quantile)对不同指标及畸变类型的影响。最后,我们通过反例揭示指标评估中的潜在缺陷,并为图像间转换模型评估中相似性度量的有效使用提出建议。