Due to the scarcity and specific imaging characteristics in medical images, light-weighting Vision Transformers (ViTs) for efficient medical image segmentation is a significant challenge, and current studies have not yet paid attention to this issue. This work revisits the relationship between CNNs and Transformers in lightweight universal networks for medical image segmentation, aiming to integrate the advantages of both worlds at the infrastructure design level. In order to leverage the inductive bias inherent in CNNs, we abstract a Transformer-like lightweight CNNs block (ConvUtr) as the patch embeddings of ViTs, feeding Transformer with denoised, non-redundant and highly condensed semantic information. Moreover, an adaptive Local-Global-Local (LGL) block is introduced to facilitate efficient local-to-global information flow exchange, maximizing Transformer's global context information extraction capabilities. Finally, we build an efficient medical image segmentation model (MobileUtr) based on CNN and Transformer. Extensive experiments on five public medical image datasets with three different modalities demonstrate the superiority of MobileUtr over the state-of-the-art methods, while boasting lighter weights and lower computational cost. Code is available at https://github.com/FengheTan9/MobileUtr.
翻译:摘要:由于医学图像的稀缺性及其特定的成像特征,轻量化视觉Transformer(ViTs)以实现高效的医学图像分割是一项重要挑战,而当前研究尚未充分关注这一问题。本文重新审视了轻量级通用医学图像分割网络中CNN与Transformer的关系,旨在从基础架构设计层面融合两者的优势。为利用CNN固有的归纳偏置,我们抽象出一种类Transformer的轻量级CNN模块(ConvUtr)作为ViT的补丁嵌入,从而向Transformer提供去噪、无冗余且高度浓缩的语义信息。此外,引入自适应局部-全局-局部(LGL)模块以促进高效的局部到全局信息流交换,最大化Transformer的全局上下文信息提取能力。最终,我们基于CNN与Transformer构建了一个高效的医学图像分割模型(MobileUtr)。在五个不同模态的公开医学图像数据集上进行的大量实验表明,MobileUtr在权重更轻、计算成本更低的条件下优于现有最先进方法。代码参见:https://github.com/FengheTan9/MobileUtr。