Respiratory diseases remain a leading cause of global mortality, where timely and accurate diagnosis is critical to improving patient outcomes and reducing healthcare burdens. While prior work has explored audio-based models for respiratory disease detection, such unimodal approaches often suffer from limited generalizability and diagnostic precision. In this paper, we propose RespiraMFM, a Multimodal Foundation Model that integrates respiratory sounds with patient medical history and symptoms to enhance diagnostic accuracy and disease detection capabilities. We introduce an effective contrastive alignment strategy for audio-text multimodal integration, allowing the model to learn better cross-modal representations between respiratory sounds and corresponding textual clinical information. We evaluate RespiraMFM across five major respiratory diseases using seven real-world datasets in both supervised fine-tuning and zero-shot settings, achieving a 9.15% improvement in AUROC on supervised tasks and a 20.98% gain on zero-shot tasks over existing baselines. These findings underscore the potential of our framework to advance early diagnosis and improve clinical decision-making in respiratory disease management.
翻译:呼吸道疾病仍是全球死亡率的主要原因,及时准确的诊断对于改善患者预后和减轻医疗负担至关重要。尽管已有研究探索基于音频的呼吸道疾病检测模型,但这种单模态方法通常存在泛化能力有限和诊断精度不足的问题。本文提出RespiraMFM——一种多模态基础模型,该模型将呼吸音与患者病史及症状相整合,以提升诊断准确性和疾病检测能力。我们引入了一种有效的音频-文本多模态对比对齐策略,使模型能够学习呼吸音与对应文本临床信息之间更优的跨模态表征。我们使用七个真实世界数据集,在监督微调和零样本两种场景下,针对五种主要呼吸道疾病对RespiraMFM进行评估。与现有基线模型相比,该模型在监督任务上AUROC提升9.15%,在零样本任务上提升20.98%。这些结果凸显了本框架在推动呼吸道疾病管理早期诊断与优化临床决策方面的潜力。