Vision-language foundation models have emerged as powerful general-purpose representation learners with strong potential for multimodal understanding, but their deterministic embeddings often fail to provide the reliability required for high-stakes biomedical applications. This work introduces MedProbCLIP, a probabilistic vision-language learning framework for chest X-ray and radiology report representation learning and bidirectional retrieval. MedProbCLIP models image and text representations as Gaussian embeddings through a probabilistic contrastive objective that explicitly captures uncertainty and many-to-many correspondences between radiographs and clinical narratives. A variational information bottleneck mitigates overconfident predictions, while MedProbCLIP employs multi-view radiograph encoding and multi-section report encoding during training to provide fine-grained supervision for clinically aligned correspondence, yet requires only a single radiograph and a single report at inference. Evaluated on the MIMIC-CXR dataset, MedProbCLIP outperforms deterministic and probabilistic baselines, including CLIP, CXR-CLIP, and PCME++, in both retrieval and zero-shot classification. Beyond accuracy, MedProbCLIP demonstrates superior calibration, risk-coverage behavior, selective retrieval reliability, and robustness to clinically relevant corruptions, underscoring the value of probabilistic vision-language modeling for improving the trustworthiness and safety of radiology image-text retrieval systems.
翻译:视觉-语言基础模型已成为强大的通用表征学习器,在多模态理解方面展现出巨大潜力,但其确定性嵌入往往无法满足高风险生物医学应用所需的可靠性要求。本研究提出MedProbCLIP,一种用于胸部X光片与放射学报告表征学习及双向检索的概率视觉-语言学习框架。该框架通过概率对比目标将图像与文本表征建模为高斯嵌入,显式捕捉放射影像与临床叙述之间的不确定性及多对多对应关系。变分信息瓶颈机制缓解了过度自信预测问题,同时MedProbCLIP在训练阶段采用多视角放射影像编码与多章节报告编码,为临床对齐的对应关系提供细粒度监督,而在推理阶段仅需单张影像与单份报告。在MIMIC-CXR数据集上的评估表明,MedProbCLIP在检索与零样本分类任务中均优于包括CLIP、CXR-CLIP和PCME++在内的确定性与概率基线方法。除准确性外,MedProbCLIP还展现出更优的校准特性、风险-覆盖行为、选择性检索可靠性以及对临床相关干扰的鲁棒性,这凸显了概率视觉-语言建模在提升放射学影像-文本检索系统可信度与安全性方面的重要价值。