Efficient Scopeformer: Towards Scalable and Rich Feature Extraction for Intracranial Hemorrhage Detection

The quality and richness of feature maps extracted by convolution neural networks (CNNs) and vision Transformers (ViTs) directly relate to the robust model performance. In medical computer vision, these information-rich features are crucial for detecting rare cases within large datasets. This work presents the "Scopeformer," a novel multi-CNN-ViT model for intracranial hemorrhage classification in computed tomography (CT) images. The Scopeformer architecture is scalable and modular, which allows utilizing various CNN architectures as the backbone with diversified output features and pre-training strategies. We propose effective feature projection methods to reduce redundancies among CNN-generated features and to control the input size of ViTs. Extensive experiments with various Scopeformer models show that the model performance is proportional to the number of convolutional blocks employed in the feature extractor. Using multiple strategies, including diversifying the pre-training paradigms for CNNs, different pre-training datasets, and style transfer techniques, we demonstrate an overall improvement in the model performance at various computational budgets. Later, we propose smaller compute-efficient Scopeformer versions with three different types of input and output ViT configurations. Efficient Scopeformers use four different pre-trained CNN architectures as feature extractors to increase feature richness. Our best Efficient Scopeformer model achieved an accuracy of 96.94\% and a weighted logarithmic loss of 0.083 with an eight times reduction in the number of trainable parameters compared to the base Scopeformer. Another version of the Efficient Scopeformer model further reduced the parameter space by almost 17 times with negligible performance reduction. Hybrid CNNs and ViTs might provide the desired feature richness for developing accurate medical computer vision models

翻译：卷积神经网络（CNN）与视觉Transformer（ViT）提取的特征图质量及丰富度直接影响模型性能的鲁棒性。在医学计算机视觉领域，这类信息丰富的特征对于从大数据集中检测罕见病例至关重要。本文提出"Scopeformer"——一种用于计算机断层扫描（CT）图像颅内出血分类的新型多CNN-ViT混合模型。该架构具有可扩展性和模块化特性，支持采用不同CNN骨干网络以获取多样化输出特征及预训练策略。我们提出了高效的特征投影方法，用于减少CNN生成特征间的冗余性并控制ViT的输入规模。基于多种Scopeformer模型的广泛实验表明，模型性能与特征提取器中采用的卷积模块数量呈正比。通过多种策略（包括CNN预训练范式多样化、差异化预训练数据集及风格迁移技术），我们证明了在不同计算预算下模型性能的全面提升。随后，我们提出三种输入/输出ViT配置下的计算高效型小规模Scopeformer变体。高效Scopeformer采用四种不同预训练CNN架构作为特征提取器以增强特征丰富度。最优高效Scopeformer模型准确率达96.94%，加权对数损失降至0.083，可训练参数量较基准Scopeformer减少八倍。另一等效模型将参数空间压缩近17倍的同时，性能几乎未受影响。混合CNN与ViT架构有望为开发精准医学计算机视觉模型提供所需的特征丰富度

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

CVPR2022论文列表出炉！2067篇论文都在这了！

专知会员服务

55+阅读 · 2022年6月6日