The complex and dynamic real-world clinical environment demands reliable deep learning (DL) systems. Out-of-distribution (OOD) detection plays a critical role in enhancing the reliability and generalizability of DL models when encountering data that deviate from the training distribution, such as unseen disease cases. However, existing OOD detection methods typically rely either on a single visual modality or solely on image-text matching, failing to fully leverage multimodal information. To overcome the challenge, we propose a novel dual-branch multimodal framework by introducing a text-image branch and a vision branch. Our framework fully exploits multimodal representations to identify OOD samples through these two complementary branches. After training, we compute scores from the text-image branch ($S_t$) and vision branch ($S_v$), and integrate them to obtain the final OOD score $S$ that is compared with a threshold for OOD detection. Comprehensive experiments on publicly available endoscopic image datasets demonstrate that our proposed framework is robust across diverse backbones and improves state-of-the-art performance in OOD detection by up to 24.84%
翻译:复杂动态的真实临床环境要求可靠的深度学习系统。分布外检测在增强深度学习模型面对偏离训练分布的数据(如未见过的疾病病例)时的可靠性和泛化能力方面发挥着关键作用。然而,现有的分布外检测方法通常仅依赖单一视觉模态或仅基于图像-文本匹配,未能充分利用多模态信息。为克服这一挑战,我们提出了一种新颖的双分支多模态框架,通过引入文本-图像分支和视觉分支。我们的框架充分利用多模态表示,通过这两个互补分支识别分布外样本。训练后,我们从文本-图像分支($S_t$)和视觉分支($S_v$)计算得分,并将其整合以获得最终分布外得分$S$,该得分与阈值进行比较以进行分布外检测。在公开可用的内窥镜图像数据集上的综合实验表明,我们提出的框架在不同骨干网络下均具有鲁棒性,并将分布外检测的最新性能提升了高达24.84%。