Similarity Metrics for MR Image-To-Image Translation

Image-to-image translation can create large impact in medical imaging, i.e. if images of a patient can be translated to another modality, type or sequence for better diagnosis. However, these methods must be validated by human reader studies, which are costly and restricted to small samples. Automatic evaluation of large samples to pre-evaluate and continuously improve methods before human validation is needed. In this study, we give an overview of reference and non-reference metrics for image synthesis assessment and investigate the ability of nine metrics, that need a reference (SSIM, MS-SSIM, PSNR, MSE, NMSE, MAE, LPIPS, NMI and PCC) and three non-reference metrics (BLUR, MSN, MNG) to detect 11 kinds of distortions in MR images from the BraSyn dataset. In addition we test a downstream segmentation metric and the effect of three normalization methods (Minmax, cMinMax and Zscore). Although PSNR and SSIM are frequently used to evaluate generative models for image-to-image-translation tasks in the medical domain, they show very specific shortcomings. SSIM ignores blurring but is very sensitive to intensity shifts in unnormalized MR images. PSNR is even more sensitive to different normalization methods and hardly measures the degree of distortions. Further metrics, such as LPIPS, NMI and DICE can be very useful to evaluate other similarity aspects. If the images to be compared are misaligned, most metrics are flawed. By carefully selecting and reasonably combining image similarity metrics, the training and selection of generative models for MR image synthesis can be improved. Many aspects of their output can be validated before final and costly evaluation by trained radiologists is conducted.

翻译：图像到图像翻译可在医学影像领域产生重大影响，例如将患者的图像转换为其他模态、类型或序列以进行更准确的诊断。然而，这些方法必须通过人工读者研究进行验证，而此类研究成本高昂且仅限于小样本。在人工验证之前，需要自动评估大样本以预评估并持续改进方法。本研究概述了用于图像合成评估的有参考和无参考度量，并探究了九种需要参考的度量（SSIM、MS-SSIM、PSNR、MSE、NMSE、MAE、LPIPS、NMI和PCC）以及三种无参考度量（BLUR、MSN、MNG）检测BraSyn数据集中MR图像11种失真的能力。此外，我们还测试了一种下游分割度量以及三种归一化方法（Minmax、cMinMax和Zscore）的效果。尽管PSNR和SSIM被频繁用于评估医学领域中图像到图像翻译任务的生成模型，但它们存在特定的局限性：SSIM忽略模糊效应，但对未归一化MR图像中的强度偏移极为敏感；PSNR则对不同归一化方法更为敏感，且几乎无法测量失真程度。而LPIPS、NMI和DICE等其他度量在评估其他相似性方面可能非常有用。若待比较图像存在未对齐情况，大多数度量都存在缺陷。通过精心选择并合理组合图像相似性度量，可以改进MR图像合成中生成模型的训练与选择过程。在最终由训练有素的放射科医师进行高成本评估之前，其输出的多方面特征即可得到验证。