Medical vision language pre-training (VLP) has emerged as a frontier of research, enabling zero-shot pathological recognition by comparing the query image with the textual descriptions for each disease. Due to the complex semantics of biomedical texts, current methods struggle to align medical images with key pathological findings in unstructured reports. This leads to the misalignment with the target disease's textual representation. In this paper, we introduce a novel VLP framework designed to dissect disease descriptions into their fundamental aspects, leveraging prior knowledge about the visual manifestations of pathologies. This is achieved by consulting a large language model and medical experts. Integrating a Transformer module, our approach aligns an input image with the diverse elements of a disease, generating aspect-centric image representations. By consolidating the matches from each aspect, we improve the compatibility between an image and its associated disease. Additionally, capitalizing on the aspect-oriented representations, we present a dual-head Transformer tailored to process known and unknown diseases, optimizing the comprehensive detection efficacy. Conducting experiments on seven downstream datasets, ours outperforms recent methods by up to 8.07% and 11.23% in AUC scores for seen and novel categories, respectively. Our code is released at \href{https://github.com/HieuPhan33/MAVL}{https://github.com/HieuPhan33/MAVL}.
翻译:医学视觉语言预训练(VLP)作为前沿研究方向,通过将查询图像与各疾病的文本描述进行比对,实现了零样本病理识别。由于生物医学文本的语义复杂性,当前方法难以将医学图像与未结构化报告中的关键病理发现对齐,导致与目标疾病文本表征的匹配偏差。本文提出一种新型VLP框架,利用病理视觉表现先验知识,将疾病描述解构为基本视角。该过程通过咨询大型语言模型和医学专家实现。结合Transformer模块,我们的方法将输入图像与疾病的多维要素对齐,生成以视角为中心的图像表征。通过整合各视角的匹配结果,我们提升了图像与其关联疾病的兼容性。此外,基于视角导向的表征,我们提出双头Transformer架构分别处理已知与未知疾病,优化综合检测效能。在七个下游数据集上的实验表明,对已知与新型疾病类别,本方法的AUC评分分别以最高8.07%和11.23%的优势超越现有方法。代码已发布于\href{https://github.com/HieuPhan33/MAVL}{https://github.com/HieuPhan33/MAVL}。