Medical data collected for making a diagnostic decision are typically multi-modal and provide complementary perspectives of a subject. A computer-aided diagnosis system welcomes multi-modal inputs; however, how to effectively fuse such multi-modal data is a challenging task and attracts a lot of attention in the medical research field. In this paper, we propose a transformer-based framework, called Alifuse, for aligning and fusing multi-modal medical data. Specifically, we convert images and unstructured and structured texts into vision and language tokens, and use intramodal and intermodal attention mechanisms to learn holistic representations of all imaging and non-imaging data for classification. We apply Alifuse to classify Alzheimer's disease and obtain state-of-the-art performance on five public datasets, by outperforming eight baselines. The source code will be available online later.
翻译:用于诊断决策的医学数据通常是多模态的,能够提供关于受试者的互补视角。计算机辅助诊断系统欢迎多模态输入,但如何有效融合这些多模态数据是一项具有挑战性的任务,并在医学研究领域引起了广泛关注。本文提出了一种基于Transformer的框架——Alifuse,用于对齐和融合多模态医学数据。具体而言,我们将图像、非结构化文本和结构化文本转换为视觉和语言标记,并利用模态内和跨模态注意力机制学习所有成像和非成像数据的整体表示以进行分类。我们将Alifuse应用于阿尔茨海默病的分类,在五个公开数据集上通过优于八个基线模型取得了最先进的性能。源代码将在后续公开提供。