In multi-modal frameworks, the alignment of cross-modal features presents a significant challenge. The predominant approach in multi-modal pre-training emphasizes either global or local alignment between modalities, utilizing extensive datasets. This bottom-up driven method often suffers from a lack of interpretability, a critical concern in radiology. Previous studies have integrated high-level labels in medical images or text, but these still rely on manual annotation, a costly and labor-intensive process. Our work introduces a novel approach by using eye-gaze data, collected synchronously by radiologists during diagnostic evaluations. This data, indicating radiologists' focus areas, naturally links chest X-rays to diagnostic texts. We propose the Eye-gaze Guided Multi-modal Alignment (EGMA) framework to harness eye-gaze data for better alignment of image and text features, aiming to reduce reliance on manual annotations and thus cut training costs. Our model demonstrates robust performance, outperforming other state-of-the-art methods in zero-shot classification and retrieval tasks. The incorporation of easily-obtained eye-gaze data during routine radiological diagnoses signifies a step towards minimizing manual annotation dependency. Additionally, we explore the impact of varying amounts of eye-gaze data on model performance, highlighting the feasibility and utility of integrating this auxiliary data into multi-modal pre-training.
翻译:在多模态框架中,跨模态特征的对齐是一项重大挑战。当前多模态预训练的主流方法侧重于利用大规模数据集实现模态间的全局或局部对齐。这种自下而上的驱动方法往往缺乏可解释性,而这恰恰是放射学领域的关键问题。先前研究已在医学图像或文本中整合了高层级标签,但这些方法仍依赖人工标注这一成本高昂且劳动密集的过程。本研究提出创新性方法,采用放射科医生在诊断评估过程中同步采集的眼动数据。该数据揭示了医生关注区域,自然地将胸部X射线影像与诊断文本关联起来。我们提出眼动引导的多模态对齐(EGMA)框架,利用眼动数据优化图像与文本特征的对齐,旨在减少对人工标注的依赖并降低训练成本。该模型在零样本分类与检索任务中展现出稳健性能,超越现有最优方法。在常规放射诊断中融入易于获取的眼动数据,标志着向降低人工标注依赖性迈出了关键一步。此外,我们探讨了不同规模眼动数据对模型性能的影响,验证了将这种辅助数据整合至多模态预训练中的可行性与实用性。