A residual dense vision transformer for medical image super-resolution with segmentation-based perceptual loss fine-tuning

Super-resolution plays an essential role in medical imaging because it provides an alternative way to achieve high spatial resolutions and image quality with no extra acquisition costs. In the past few decades, the rapid development of deep neural networks has promoted super-resolution performance with novel network architectures, loss functions and evaluation metrics. Specifically, vision transformers dominate a broad range of computer vision tasks, but challenges still exist when applying them to low-level medical image processing tasks. This paper proposes an efficient vision transformer with residual dense connections and local feature fusion to achieve efficient single-image super-resolution (SISR) of medical modalities. Moreover, we implement a general-purpose perceptual loss with manual control for image quality improvements of desired aspects by incorporating prior knowledge of medical image segmentation. Compared with state-of-the-art methods on four public medical image datasets, the proposed method achieves the best PSNR scores of 6 modalities among seven modalities. It leads to an average improvement of $+0.09$ dB PSNR with only 38\% parameters of SwinIR. On the other hand, the segmentation-based perceptual loss increases $+0.14$ dB PSNR on average for SOTA methods, including CNNs and vision transformers. Additionally, we conduct comprehensive ablation studies to discuss potential factors for the superior performance of vision transformers over CNNs and the impacts of network and loss function components. The code will be released on GitHub with the paper published.

翻译：超分辨率在医学成像中至关重要，因为它提供了一种无需额外采集成本即可实现高空间分辨率和图像质量的替代方案。过去数十年间，深度神经网络的快速发展通过新颖的网络架构、损失函数和评估指标提升了超分辨率性能。具体而言，视觉Transformer在广泛的计算机视觉任务中占据主导地位，但在将其应用于低级医学图像处理任务时仍面临挑战。本文提出了一种结合残差稠密连接与局部特征融合的高效视觉Transformer，以实现医学模态的高效单图像超分辨率。此外，我们通过引入医学图像分割的先验知识，实现了一种具有手动控制能力、可针对性提升所需图像质量方面的通用感知损失。在四个公开医学图像数据集上与现有最优方法的对比表明，所提方法在七种模态中取得了六种模态的最佳峰值信噪比。其在仅使用SwinIR 38%参数量的情况下，平均PSNR提升了+0.09分贝。另一方面，基于分割的感知损失使包括卷积神经网络和视觉Transformer在内的现有最优方法平均PSNR提升了+0.14分贝。此外，我们进行了全面的消融研究，探讨视觉Transformer优于卷积神经网络的潜在因素以及网络与损失函数组件的影响。代码将与论文一同在GitHub上发布。