Multimodal deep learning utilizing imaging and diagnostic reports has made impressive progress in the field of medical imaging diagnostics, demonstrating a particularly strong capability for auxiliary diagnosis in cases where sufficient annotation information is lacking. Nonetheless, localizing diseases accurately without detailed positional annotations remains a challenge. Although existing methods have attempted to utilize local information to achieve fine-grained semantic alignment, their capability in extracting the fine-grained semantics of the comprehensive contextual within reports is limited. To solve this problem, we introduce a new method that takes full sentences from textual reports as the basic units for local semantic alignment. Our approach combines chest X-ray images with their corresponding textual reports, performing contrastive learning at both global and local levels. The leading results obtained by our method on multiple datasets confirm its efficacy in the task of lesion localization.
翻译:多模态深度学习利用影像与诊断报告,在医学影像诊断领域取得了显著进展,尤其在标注信息匮乏的情况下展现出强大的辅助诊断能力。然而,在缺乏详细位置标注的条件下,准确定位疾病仍是一项挑战。现有方法虽尝试利用局部信息实现细粒度语义对齐,但在提取报告内全面上下文的细粒度语义方面能力有限。为解决此问题,我们提出一种新方法,以文本报告中的完整句子作为局部语义对齐的基本单元。该方法结合胸部X光影像及其对应的文本报告,在全局与局部两个层面进行对比学习。我们在多个数据集上取得的领先结果,证实了该方法在病灶定位任务中的有效性。