Visual task adaptation has been demonstrated to be effective in adapting pre-trained Vision Transformers (ViTs) to general downstream visual tasks using specialized learnable layers or tokens. However, there is yet a large-scale benchmark to fully explore the effect of visual task adaptation on the realistic and important medical domain, particularly across diverse medical visual modalities, such as color images, X-ray, and CT. To close this gap, we present Med-VTAB, a large-scale Medical Visual Task Adaptation Benchmark consisting of 1.68 million medical images for diverse organs, modalities, and adaptation approaches. Based on Med-VTAB, we explore the scaling law of medical prompt tuning concerning tunable parameters and the generalizability of medical visual adaptation using non-medical/medical pre-train weights. Besides, we study the impact of patient ID out-of-distribution on medical visual adaptation, which is a real and challenging scenario. Furthermore, results from Med-VTAB indicate that a single pre-trained model falls short in medical task adaptation. Therefore, we introduce GMoE-Adapter, a novel method that combines medical and general pre-training weights through a gated mixture-of-experts adapter, achieving state-of-the-art results in medical visual task adaptation.
翻译:视觉任务适应已被证明能够通过使用专门的可学习层或标记,将预训练的视觉Transformer(ViTs)有效调整以适应通用的下游视觉任务。然而,目前尚缺乏一个大规模基准来全面探究视觉任务适应在现实且重要的医学领域中的效果,尤其是在多样化的医学视觉模态(如彩色图像、X射线和CT)上。为填补这一空白,我们提出了Med-VTAB,一个大规模医学视觉任务适应基准,包含168万张涵盖不同器官、模态及适应方法的医学图像。基于Med-VTAB,我们探索了医学提示调优在可调参数方面的缩放规律,以及使用非医学/医学预训练权重的医学视觉适应泛化能力。此外,我们研究了患者ID分布外对医学视觉适应的影响,这是一个真实且具有挑战性的场景。最后,Med-VTAB的结果表明,单一预训练模型在医学任务适应中表现不足。为此,我们引入了GMoE-Adapter,一种通过门控混合专家适配器结合医学与通用预训练权重的新方法,在医学视觉任务适应中取得了最先进的结果。