The quality and richness of feature maps extracted by convolution neural networks (CNNs) and vision Transformers (ViTs) directly relate to the robust model performance. In medical computer vision, these information-rich features are crucial for detecting rare cases within large datasets. This work presents the "Scopeformer," a novel multi-CNN-ViT model for intracranial hemorrhage classification in computed tomography (CT) images. The Scopeformer architecture is scalable and modular, which allows utilizing various CNN architectures as the backbone with diversified output features and pre-training strategies. We propose effective feature projection methods to reduce redundancies among CNN-generated features and to control the input size of ViTs. Extensive experiments with various Scopeformer models show that the model performance is proportional to the number of convolutional blocks employed in the feature extractor. Using multiple strategies, including diversifying the pre-training paradigms for CNNs, different pre-training datasets, and style transfer techniques, we demonstrate an overall improvement in the model performance at various computational budgets. Later, we propose smaller compute-efficient Scopeformer versions with three different types of input and output ViT configurations. Efficient Scopeformers use four different pre-trained CNN architectures as feature extractors to increase feature richness. Our best Efficient Scopeformer model achieved an accuracy of 96.94\% and a weighted logarithmic loss of 0.083 with an eight times reduction in the number of trainable parameters compared to the base Scopeformer. Another version of the Efficient Scopeformer model further reduced the parameter space by almost 17 times with negligible performance reduction. Hybrid CNNs and ViTs might provide the desired feature richness for developing accurate medical computer vision models
翻译:卷积神经网络(CNN)与视觉Transformer(ViT)提取的特征图质量及丰富度直接影响模型性能的鲁棒性。在医学计算机视觉领域,这类信息丰富的特征对于从大数据集中检测罕见病例至关重要。本文提出"Scopeformer"——一种用于计算机断层扫描(CT)图像颅内出血分类的新型多CNN-ViT混合模型。该架构具有可扩展性和模块化特性,支持采用不同CNN骨干网络以获取多样化输出特征及预训练策略。我们提出了高效的特征投影方法,用于减少CNN生成特征间的冗余性并控制ViT的输入规模。基于多种Scopeformer模型的广泛实验表明,模型性能与特征提取器中采用的卷积模块数量呈正比。通过多种策略(包括CNN预训练范式多样化、差异化预训练数据集及风格迁移技术),我们证明了在不同计算预算下模型性能的全面提升。随后,我们提出三种输入/输出ViT配置下的计算高效型小规模Scopeformer变体。高效Scopeformer采用四种不同预训练CNN架构作为特征提取器以增强特征丰富度。最优高效Scopeformer模型准确率达96.94%,加权对数损失降至0.083,可训练参数量较基准Scopeformer减少八倍。另一等效模型将参数空间压缩近17倍的同时,性能几乎未受影响。混合CNN与ViT架构有望为开发精准医学计算机视觉模型提供所需的特征丰富度