Image quality assessment (IQA) is not just indispensable in clinical practice to ensure high standards, but also in the development stage of novel algorithms that operate on medical images with reference data. This paper provides a structured and comprehensive collection of examples where the two most common full reference (FR) image quality measures prove to be unsuitable for the assessment of novel algorithms using different kinds of medical images, including real-world MRI, CT, OCT, X-Ray, digital pathology and photoacoustic imaging data. In particular, the FR-IQA measures PSNR and SSIM are known and tested for working successfully in many natural imaging tasks, but discrepancies in medical scenarios have been noted in the literature. Inconsistencies arising in medical images are not surprising, as they have very different properties than natural images which have not been targeted nor tested in the development of the mentioned measures, and therefore might imply wrong judgement of novel methods for medical images. Therefore, improvement is urgently needed in particular in this era of AI to increase explainability, reproducibility and generalizability in machine learning for medical imaging and beyond. On top of the pitfalls we will provide ideas for future research as well as suggesting guidelines for the usage of FR-IQA measures applied to medical images.
翻译:图像质量评估(IQA)不仅在临床实践中对确保高标准不可或缺,在基于参考数据对医学图像进行处理的新算法开发阶段也至关重要。本文系统且全面地收集了一系列案例,证明两种最常用的全参考(FR)图像质量度量方法在评估使用各类医学图像(包括真实世界的MRI、CT、OCT、X射线、数字病理学和光声成像数据)的新算法时并不适用。具体而言,FR-IQA度量指标PSNR和SSIM在许多自然图像处理任务中被广泛认知和验证是有效的,但文献中已指出其在医学场景中存在不一致性。这种在医学图像中出现的不一致性并不令人意外,因为医学图像具有与自然图像截然不同的特性,而上述度量方法在开发过程中既未针对也未测试过医学图像,因此可能导致对医学图像新方法的错误评判。因此,特别是在当前人工智能时代,亟需改进相关工作,以提升医学影像机器学习乃至更广泛领域的可解释性、可复现性和泛化能力。除了指出这些缺陷,我们还将为未来研究提供思路,并就FR-IQA度量在医学图像中的应用提出指导原则。