Language modeling have shown impressive progress in generating compelling text with good accuracy and high semantic coherence. An interesting research direction is to augment these powerful models for specific applications using contextual information. In this work, we explore multi-modal language modeling for healthcare applications. We are interested in outcome prediction and patient triage in hospital emergency department based on text information in chief complaints and vital signs recorded at triage. We adapt Perceiver - a modality-agnostic transformer-based model that has shown promising results in several applications. Since vital-sign modality is represented in tabular format, we modified Perceiver position encoding to ensure permutation invariance. We evaluated the multi-modal language model for the task of diagnosis code prediction using MIMIC-IV ED dataset on 120K visits. In the experimental analysis, we show that mutli-modality improves the prediction performance compared with models trained solely on text or vital signs. We identified disease categories for which multi-modality leads to performance improvement and show that for these categories, vital signs have added predictive power. By analyzing the cross-attention layer, we show how multi-modality contributes to model predictions. This work gives interesting insights on the development of multi-modal language models for healthcare applications.
翻译:语言建模在生成语义连贯且准确度高的文本方面取得了显著进展。一个有趣的研究方向是利用上下文信息增强这些强大模型在特定应用中的能力。本文探索了医疗健康领域的多模态语言建模。我们关注基于急诊科患者主诉文本信息及分诊时记录的生命体征数据,进行结局预测和患者分诊。我们采用Perceiver——一种基于Transformer的模态无关模型,该模型已在多个应用中展现出良好效果。由于生命体征数据以表格形式呈现,我们修改了Perceiver的位置编码以确保排列不变性。我们使用MIMIC-IV ED数据集的12万次就诊记录,评估该多模态语言模型在诊断编码预测任务上的表现。实验分析表明,与仅基于文本或生命体征训练的模型相比,多模态融合显著提升了预测性能。我们识别出多模态能提升性能的疾病类别,并证明对这些类别而言,生命体征数据具有额外预测能力。通过分析交叉注意力层,我们揭示了多模态如何促进模型预测。本研究为医疗健康领域的多模态语言模型开发提供了重要启示。