We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.
翻译:我们提出了一种基于Transformer的多模态框架,用于为MRI扫描生成临床相关的描述。该系统结合了DEiT-Small视觉Transformer作为图像编码器、MediCareBERT用于描述嵌入以及一个定制的基于LSTM的解码器。该架构旨在通过混合余弦-MSE损失和基于向量相似度的对比推理,实现图像与文本嵌入的语义对齐。我们在MultiCaRe数据集上对方法进行了基准测试,比较了针对过滤后仅含脑部MRI图像与通用MRI图像的性能,并与包括BLIP、R2GenGPT及近期基于Transformer的方法在内的先进医学图像描述方法进行了对比。结果表明,专注于领域特定数据可提升描述准确性和语义对齐。本研究为自动化医学图像报告提出了一种可扩展、可解释的解决方案。