We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps. It tackles key challenges in contrastive multi-modal medical imaging analysis, notably data scarcity and the "modality gap" -- a significant disparity between image and text embeddings that diminishes the quality of representations and hampers cross-modal interoperability. eCLIP integrates a heatmap processor and leverages mixup augmentation to efficiently utilize the scarce expert annotations, thus boosting the model's learning effectiveness. eCLIP is designed to be generally applicable to any variant of CLIP without requiring any modifications of the core architecture. Through detailed evaluations across several tasks, including zero-shot inference, linear probing, cross-modal retrieval, and Retrieval Augmented Generation (RAG) of radiology reports using a frozen Large Language Model, eCLIP showcases consistent improvements in embedding quality. The outcomes reveal enhanced alignment and uniformity, affirming eCLIP's capability to harness high-quality annotations for enriched multi-modal analysis in the medical imaging domain.
翻译:我们提出了eCLIP,这是CLIP模型的一个增强版本,它整合了以放射科医师眼动热图形式存在的专家标注。该模型解决了对比式多模态医学影像分析中的关键挑战,特别是数据稀缺性和"模态鸿沟"——图像与文本嵌入之间显著的不匹配,这种不匹配会降低表征质量并阻碍跨模态互操作性。eCLIP集成了一个热图处理器,并利用混合增强技术来高效利用稀缺的专家标注,从而提升模型的学习效能。eCLIP被设计为可普遍适用于任何CLIP变体,而无需修改其核心架构。通过对多项任务的详细评估,包括零样本推理、线性探测、跨模态检索以及使用冻结大型语言模型的放射学报告检索增强生成(RAG),eCLIP在嵌入质量方面展示出了一致的改进。结果表明,其对齐性和均匀性均得到增强,这证实了eCLIP能够利用高质量标注来提升医学影像领域的多模态分析能力。