Medical image retrieval aims to identify clinically relevant lesion cases to support diagnostic decision making, education, and quality control. In practice, retrieval queries often combine a reference lesion image with textual descriptors such as dermoscopic features. We study composed vision-language retrieval for skin cancer, where each query consists of an image to text pair and the database contains biopsy-confirmed, multi-class disease cases. We propose a transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local evidence while preserving global consistency. Experiments on the public Derm7pt dataset demonstrate consistent improvements over state-of-the-art methods. The proposed framework enables efficient access to relevant medical records and supports practical clinical deployment.
翻译:医学图像检索旨在识别临床相关的病变病例,以支持诊断决策、医学教育与质量控制。实际操作中,检索查询常将参考病变图像与皮肤镜特征等文本描述符相结合。本研究面向皮肤癌的组合式视觉-语言检索,其中每个查询由图像-文本对构成,数据库包含经活检确诊的多类别疾病病例。我们提出一种基于Transformer的框架,该框架学习层次化组合查询表示,并在查询与候选图像间执行全局-局部联合对齐:局部对齐通过多重空间注意力掩膜聚合判别性区域,全局对齐则提供整体语义监督。最终相似度通过凸性、领域感知加权计算,在保留全局一致性的同时强调临床显著的局部证据。在公开的Derm7pt数据集上的实验表明,该方法相较于现有最优方法持续取得改进。所提框架实现了对相关医疗记录的高效访问,并支持实际临床部署。