Self-supervised learning (SSL) with Vision Transformers (ViT) has shown immense potential in medical image analysis. However, the quadratic complexity ($\mathcal{O}(N^2)$) of standard self-attention poses a severe barrier for high-resolution biomedical tasks, effectively excluding resource-constrained research labs from utilizing state-of-the-art models. To address this computational bottleneck without sacrificing diagnostic accuracy, we propose \textbf{MIRAM}, a Multi-scale Masked Autoencoder that leverages a \textbf{hybrid-attention mechanism}. Our architecture uniquely decouples semantic learning from detail reconstruction using a dual-decoder design: a standard transformer decoder captures global semantics at low resolution, while a linear-complexity decoder (comparing Linformer, Performer, and Nyströmformer) handles the computationally expensive high-resolution reconstruction. This reduces the complexity of the upscaling stage from quadratic to linear ($\mathcal{O}(N)$), enabling high-fidelity training on consumer-grade GPUs. We validate our approach on the CBIS-DDSM mammography dataset. Remarkably, our \textbf{Nyströmformer-based variant} achieves a classification accuracy of \textbf{61.0\%}, outperforming both standard MAE (58.9\%) and MoCo-v3 (60.2\%) while requiring significantly less memory. These results demonstrate that hybrid-attention architectures can democratize high-resolution medical AI, making powerful SSL accessible to researchers with limited hardware resources.
翻译:视觉Transformer(ViT)的自监督学习在医学图像分析中展现出巨大潜力。然而,标准自注意力的二次复杂度($\mathcal{O}(N^2)$)为高分辨率生物医学任务设置了严重障碍,实质上将资源受限的研究实验室排除在利用最先进模型的范围之外。为解决这一计算瓶颈同时不牺牲诊断准确性,我们提出了\textbf{MIRAM}——一种利用\textbf{混合注意力机制}的多尺度掩码自编码器。我们的架构通过双解码器设计独特地将语义学习与细节重建解耦:标准Transformer解码器在低分辨率下捕获全局语义,而线性复杂度解码器(比较了Linformer、Performer和Nyströmformer)处理计算成本高昂的高分辨率重建。这将上采样阶段的复杂度从二次降低至线性($\mathcal{O}(N)$),使得在消费级GPU上进行高保真训练成为可能。我们在CBIS-DDSM乳腺X线摄影数据集上验证了我们的方法。值得注意的是,我们\textbf{基于Nyströmformer的变体}实现了\textbf{61.0\%}的分类准确率,优于标准MAE(58.9%)和MoCo-v3(60.2%),同时所需内存显著减少。这些结果表明,混合注意力架构能够推动高分辨率医学AI的普及,使硬件资源有限的研究人员也能使用强大的自监督学习方法。