Medical diagnosis requires the effective synthesis of visual manifestations and clinical metadata. However, existing methods often treat metadata as isolated tags, failing to exploit the rich semantic knowledge embedded in clinical descriptions. We propose PRIMA (Pre-training with Risk-integrated Image-Metadata Alignment), a framework that integrates domain-specific knowledge into multi-modal representation learning. We first curate an expert corpus of risk-disease correlations via Retrieval-Augmented Generation (RAG) to refine Clinical ModernBERT, embedding diagnostic priors into the text encoder. To bridge the modality gap, we introduce a dual-encoder pre-training strategy utilizing DINOv3 and our refined BERT, optimized by a suite of four complementary loss functions. These losses are designed to capture multi-granular semantic alignment and handle the ambiguity of clinical correlations through soft labels. Finally, we leverage Qwen-3 to fuse these aligned features for precise disease classification. Extensive experiments demonstrate that PRIMA effectively harmonizes pixel-level features with abstract clinical expertise, significantly outperforming other state-of-the-art methods. Notably, our framework achieves superior robustness without the need for massive data collection or exhaustive computational resources. Our code will be made public upon acceptance.
翻译:医学诊断需要有效整合视觉表现与临床元数据。然而,现有方法通常将元数据视为孤立标签,未能充分利用临床描述中蕴含的丰富语义知识。我们提出PRIMA(基于风险整合图像-元数据对齐的预训练框架),该框架将领域特定知识融入多模态表征学习。我们首先通过检索增强生成(RAG)技术构建风险-疾病关联的专家知识库,用以优化Clinical ModernBERT,将诊断先验知识嵌入文本编码器。为弥合模态鸿沟,我们引入基于DINOv3与优化BERT的双编码器预训练策略,通过四组互补损失函数进行优化。这些损失函数旨在捕捉多粒度语义对齐,并利用软标签处理临床关联的模糊性。最后,我们借助Qwen-3融合对齐特征以实现精准疾病分类。大量实验表明,PRIMA能有效协调像素级特征与抽象临床知识,显著优于现有先进方法。值得注意的是,该框架在不依赖海量数据收集或巨额计算资源的情况下实现了卓越的鲁棒性。代码将在论文录用后公开。