Pretraining vision transformers (ViT) with attention guided masked image modeling (MIM) has shown to increase downstream accuracy for natural image analysis. Hierarchical shifted window (Swin) transformer, often used in medical image analysis cannot use attention guided masking as it lacks an explicit [CLS] token, needed for computing attention maps for selective masking. We thus enhanced Swin with semantic class attention. We developed a co-distilled Swin transformer that combines a noisy momentum updated teacher to guide selective masking for MIM. Our approach called \textsc{s}e\textsc{m}antic \textsc{a}ttention guided co-distillation with noisy teacher \textsc{r}egularized Swin \textsc{T}rans\textsc{F}ormer (SMARTFormer) was applied for analyzing 3D computed tomography datasets with lung nodules and malignant lung cancers (LC). We also analyzed the impact of semantic attention and noisy teacher on pretraining and downstream accuracy. SMARTFormer classified lesions (malignant from benign) with a high accuracy of 0.895 of 1000 nodules, predicted LC treatment response with accuracy of 0.74, and achieved high accuracies even in limited data regimes. Pretraining with semantic attention and noisy teacher improved ability to distinguish semantically meaningful structures such as organs in a unsupervised clustering task and localize abnormal structures like tumors. Code, models will be made available through GitHub upon paper acceptance.
翻译:通过注意力引导的掩码图像建模(MIM)对视觉Transformer(ViT)进行预训练,已被证明能提升自然图像分析的下游任务准确率。在医学图像分析中常用的分层移位窗口(Swin)Transformer,由于缺乏用于计算选择性掩码注意力图的显式[CLS]标记,无法直接使用注意力引导的掩码策略。为此,我们通过语义类别注意力对Swin进行了增强。我们开发了一种协同蒸馏的Swin Transformer,它结合了一个基于噪声动量更新的教师模型,以指导MIM中的选择性掩码。我们提出的方法称为语义注意力引导的噪声教师正则化Swin Transformer协同蒸馏(SMARTFormer),并将其应用于分析包含肺结节和恶性肺癌(LC)的3D计算机断层扫描数据集。我们还分析了语义注意力和噪声教师对预训练及下游任务准确率的影响。SMARTFormer在1000个结节中实现了0.895的高准确率以区分病变(恶性与良性),预测肺癌治疗反应的准确率达到0.74,并且在有限数据条件下仍能取得高准确率。结合语义注意力和噪声教师的预训练,提升了模型在无监督聚类任务中区分具有语义意义的结构(如器官)以及定位异常结构(如肿瘤)的能力。代码与模型将在论文录用后通过GitHub发布。