A novel method to enhance pneumonia detection via a model-level ensembling of CNN and vision transformer

Pneumonia remains a leading cause of morbidity and mortality worldwide. Chest X-ray (CXR) imaging is a fundamental diagnostic tool, but traditional analysis relies on time-intensive expert evaluation. Recently, deep learning has shown immense potential for automating pneumonia detection from CXRs. This paper explores applying neural networks to improve CXR-based pneumonia diagnosis. We developed a novel model fusing Convolution Neural networks (CNN) and Vision Transformer networks via model-level ensembling. Our fusion architecture combines a ResNet34 variant and a Multi-Axis Vision Transformer small model. Both base models are initialized with ImageNet pre-trained weights. The output layers are removed, and features are combined using a flattening layer before final classification. Experiments used the Kaggle pediatric pneumonia dataset containing 1,341 normal and 3,875 pneumonia CXR images. We compared our model against standalone ResNet34, Vision Transformer, and Swin Transformer Tiny baseline models using identical training procedures. Extensive data augmentation, Adam optimization, learning rate warmup, and decay were employed. The fusion model achieved a state-of-the-art accuracy of 94.87%, surpassing the baselines. We also attained excellent sensitivity, specificity, kappa score, and positive predictive value. Confusion matrix analysis confirms fewer misclassifications. The ResNet34 and Vision Transformer combination enables jointly learning robust features from CNNs and Transformer paradigms. This model-level ensemble technique effectively integrates their complementary strengths for enhanced pneumonia classification.

翻译：肺炎仍是全球发病率和死亡率的主要原因之一。胸部X光成像是基础诊断工具，但传统分析依赖耗时的专家评估。近期深度学习在通过胸部X光片自动检测肺炎方面展现出巨大潜力。本文探索应用神经网络改进基于胸部X光片的肺炎诊断。我们通过模型级集成提出了一种融合卷积神经网络与视觉Transformer网络的新模型。该融合架构将ResNet34变体与多轴视觉Transformer小型模型相结合，两个基础模型均使用ImageNet预训练权重初始化。去除输出层后，通过展平层对特征进行融合，最终完成分类。实验采用包含1341张正常和3875张肺炎胸部X光片的Kaggle儿科肺炎数据集。在相同训练流程下，我们将本模型与独立的ResNet34、视觉Transformer及Swin Transformer Tiny基线模型进行对比。研究采用广泛数据增强、Adam优化器、学习率预热与衰减策略。融合模型取得94.87%的最优准确率，超越所有基线模型，同时获得优异的灵敏度、特异度、卡帕评分和阳性预测值。混淆矩阵分析证实误分类更少。ResNet34与视觉Transformer的组合可实现从卷积网络和Transformer范式中联合学习鲁棒特征。这种模型级集成技术能有效整合两者的互补优势，提升肺炎分类性能。