Efficiently Training Vision Transformers on Structural MRI Scans for Alzheimer's Disease Detection

Neuroimaging of large populations is valuable to identify factors that promote or resist brain disease, and to assist diagnosis, subtyping, and prognosis. Data-driven models such as convolutional neural networks (CNNs) have increasingly been applied to brain images to perform diagnostic and prognostic tasks by learning robust features. Vision transformers (ViT) - a new class of deep learning architectures - have emerged in recent years as an alternative to CNNs for several computer vision applications. Here we tested variants of the ViT architecture for a range of desired neuroimaging downstream tasks based on difficulty, in this case for sex and Alzheimer's disease (AD) classification based on 3D brain MRI. In our experiments, two vision transformer architecture variants achieved an AUC of 0.987 for sex and 0.892 for AD classification, respectively. We independently evaluated our models on data from two benchmark AD datasets. We achieved a performance boost of 5% and 9-10% upon fine-tuning vision transformer models pre-trained on synthetic (generated by a latent diffusion model) and real MRI scans, respectively. Our main contributions include testing the effects of different ViT training strategies including pre-training, data augmentation and learning rate warm-ups followed by annealing, as pertaining to the neuroimaging domain. These techniques are essential for training ViT-like models for neuroimaging applications where training data is usually limited. We also analyzed the effect of the amount of training data utilized on the test-time performance of the ViT via data-model scaling curves.

翻译：大规模人群的神经影像对于识别促进或抵御脑部疾病的因素，以及辅助诊断、亚型分类和预后评估具有重要价值。基于数据驱动的模型（如卷积神经网络）已越来越多地应用于脑部影像分析，通过学习鲁棒特征执行诊断和预后任务。视觉Transformer——一类新型深度学习架构——近年来作为卷积神经网络的替代方案，已在多项计算机视觉应用中崭露头角。本研究基于任务难度，针对多种神经影像下游任务测试了ViT架构的变体，具体场景为基于三维脑部MRI进行性别分类和阿尔茨海默病分类。实验中，两种视觉Transformer架构变体在性别分类中AUC达到0.987，在阿尔茨海默病分类中AUC达到0.892。我们在两个基准AD数据集上独立评估了模型性能。通过微调分别在合成MRI（由潜在扩散模型生成）和真实MRI扫描上预训练的视觉Transformer模型，性能分别提升了5%和9-10%。主要贡献包括测试了不同ViT训练策略（如预训练、数据增强、学习率预热身及退火）在神经影像领域的效果。这些技术对于训练神经影像应用中的ViT类模型至关重要（因为此类领域训练数据通常有限）。此外，我们通过数据-模型缩放曲线分析了训练数据量对ViT测试性能的影响。