Vision Transformers (ViTs) are widely adopted in medical imaging tasks, and some existing efforts have been directed towards vision-language training for Chest X-rays (CXRs). However, we envision that there still exists a potential for improvement in vision-only training for CXRs using ViTs, by aggregating information from multiple scales, which has been proven beneficial for non-transformer networks. Hence, we have developed LT-ViT, a transformer that utilizes combined attention between image tokens and randomly initialized auxiliary tokens that represent labels. Our experiments demonstrate that LT-ViT (1) surpasses the state-of-the-art performance using pure ViTs on two publicly available CXR datasets, (2) is generalizable to other pre-training methods and therefore is agnostic to model initialization, and (3) enables model interpretability without grad-cam and its variants.
翻译:视觉Transformer(ViT)已被广泛应用于医学影像任务,部分现有研究已针对胸部X光片的视觉-语言训练展开工作。然而,我们预见通过聚合多尺度信息(已证明对非Transformer网络有益),利用ViT进行纯视觉训练的胸部X光片识别仍有改进空间。为此,我们开发了LT-ViT——一种通过图像令牌与代表标签的随机初始化辅助令牌的联合注意力机制的Transformer。实验表明,LT-ViT:(1)在两个公开胸部X光数据集上以纯ViT架构超越当前最优性能;(2)可泛化至其他预训练方法,因此与模型初始化无关;(3)无需梯度加权类激活映射及其变体即可实现模型可解释性。