Deep learning models can be applied successfully in real-work problems; however, training most of these models requires massive data. Recent methods use language and vision, but unfortunately, they rely on datasets that are not usually publicly available. Here we pave the way for further research in the multimodal language-vision domain for radiology. In this paper, we train a representation learning method that uses local and global representations of the language and vision through an attention mechanism and based on the publicly available Indiana University Radiology Report (IU-RR) dataset. Furthermore, we use the learned representations to diagnose five lung pathologies: atelectasis, cardiomegaly, edema, pleural effusion, and consolidation. Finally, we use both supervised and zero-shot classifications to extensively analyze the performance of the representation learning on the IU-RR dataset. Average Area Under the Curve (AUC) is used to evaluate the accuracy of the classifiers for classifying the five lung pathologies. The average AUC for classifying the five lung pathologies on the IU-RR test set ranged from 0.85 to 0.87 using the different training datasets, namely CheXpert and CheXphoto. These results compare favorably to other studies using UI-RR. Extensive experiments confirm consistent results for classifying lung pathologies using the multimodal global local representations of language and vision information.
翻译:深度学习模型可成功应用于实际任务,但大多数模型的训练需要海量数据。现有方法虽结合语言与视觉模态,却因依赖非公开数据集而受限。本研究为放射学领域的多模态语言-视觉研究开辟新路径。本文提出一种基于注意力机制的表示学习方法,利用印第安纳大学放射学报告(IU-RR)公开数据集,融合语言与视觉模态的局部与全局表征。进而采用学习到的表征诊断五种肺部病变:肺不张、心脏肥大、水肿、胸腔积液与肺实变。最后通过监督学习与零样本分类两种范式,在IU-RR数据集上系统分析表示学习性能。以平均ROC曲线下面积(AUC)评估五种肺部病变分类器的准确率。基于CheXpert与CheXphoto不同训练数据集,IU-RR测试集上五种肺部病变分类的平均AUC达0.85-0.87,相较于其他基于UI-RR的研究表现优异。大量实验证实,利用语言与视觉信息的多模态全局局部表征进行肺部病变分类具有稳定结果。