Understanding brain disorders is crucial for accurate clinical diagnosis and treatment. Recent advances in Multimodal Large Language Models (MLLMs) offer a promising approach to interpreting medical images with the support of text descriptions. However, previous research has primarily focused on 2D medical images, leaving richer spatial information of 3D images under-explored, and single-modality-based methods are limited by overlooking the critical clinical information contained in other modalities. To address this issue, this paper proposes Brain-Adapter, a novel approach that incorporates an extra bottleneck layer to learn new knowledge and instill it into the original pre-trained knowledge. The major idea is to incorporate a lightweight bottleneck layer to train fewer parameters while capturing essential information and utilize a Contrastive Language-Image Pre-training (CLIP) strategy to align multimodal data within a unified representation space. Extensive experiments demonstrated the effectiveness of our approach in integrating multimodal data to significantly improve the diagnosis accuracy without high computational costs, highlighting the potential to enhance real-world diagnostic workflows.
翻译:理解脑部疾病对于准确的临床诊断和治疗至关重要。多模态大语言模型(MLLMs)的最新进展为在文本描述支持下解读医学图像提供了一种前景广阔的方法。然而,先前的研究主要集中于二维医学图像,使得三维图像更丰富的空间信息未被充分探索,且基于单模态的方法因忽略了其他模态所包含的关键临床信息而受限。为解决此问题,本文提出Brain-Adapter,这是一种通过引入额外瓶颈层来学习新知识并将其注入原始预训练知识的新方法。其主要思想是引入轻量级瓶颈层以训练更少的参数同时捕获关键信息,并利用对比语言-图像预训练(CLIP)策略将多模态数据对齐到统一的表示空间中。大量实验证明,我们的方法在整合多模态数据以显著提高诊断准确性方面具有有效性,且无需高昂计算成本,凸显了其增强现实世界诊断工作流程的潜力。