This research explores the integration of language models and unsupervised anomaly detection in medical imaging, addressing two key questions: (1) Can language models enhance the interpretability of anomaly detection maps? and (2) Can anomaly maps improve the generalizability of language models in open-set anomaly detection tasks? To investigate these questions, we introduce a new dataset for multi-image visual question-answering on brain magnetic resonance images encompassing multiple conditions. We propose KQ-Former (Knowledge Querying Transformer), which is designed to optimally align visual and textual information in limited-sample contexts. Our model achieves a 60.81% accuracy on closed questions, covering disease classification and severity across 15 different classes. For open questions, KQ-Former demonstrates a 70% improvement over the baseline with a BLEU-4 score of 0.41, and achieves the highest entailment ratios (up to 71.9%) and lowest contradiction ratios (down to 10.0%) among various natural language inference models. Furthermore, integrating anomaly maps results in an 18% accuracy increase in detecting open-set anomalies, thereby enhancing the language model's generalizability to previously unseen medical conditions. The code and dataset are available at https://github.com/compai-lab/miccai-2024-junli?tab=readme-ov-file
翻译:本研究探索了语言模型与无监督异常检测在医学影像中的整合,旨在解决两个关键问题:(1) 语言模型能否增强异常检测图的可解释性?(2) 异常图能否提升语言模型在开放集异常检测任务中的泛化能力?为探究这些问题,我们引入了一个针对脑部磁共振图像的多图像视觉问答新数据集,涵盖多种病理状况。我们提出了KQ-Former(知识查询Transformer),其设计目标是在有限样本场景下实现视觉与文本信息的最优对齐。该模型在封闭式问题上达到60.81%的准确率,覆盖15个不同类别的疾病分类与严重程度评估。对于开放式问题,KQ-Former相比基线模型在BLEU-4分数(0.41)上实现70%的性能提升,并在多种自然语言推理模型中取得最高的蕴含率(最高达71.9%)与最低的矛盾率(最低至10.0%)。此外,融合异常图使开放集异常检测准确率提升18%,从而增强了语言模型对未见医学病症的泛化能力。代码与数据集已公开于https://github.com/compai-lab/miccai-2024-junli?tab=readme-ov-file。