Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.
翻译:近期医学视觉语言模型的进展引导了视觉表征的学习;然而,这种监督形式受限于配对图文数据的可用性,从而引发了一个问题:是否可以在不依赖语言监督的情况下学习到稳健的放射学编码器?在本工作中,我们提出了RadJEPA,一个基于联合嵌入预测架构的自监督框架,该框架无需语言监督即可学习。该模型仅通过未标注的胸部X射线图像进行预训练,学习预测被遮蔽图像区域的潜在表征。这一预测目标在根本上不同于图文预训练和DINO风格的自蒸馏方法:RadJEPA并非对齐不同视角或模态的全局表征,而是显式地对潜在空间预测进行建模。我们在疾病分类、语义分割和报告生成任务上评估了所学编码器的性能。在各项基准测试中,RadJEPA的表现均超越了包括Rad-DINO在内的现有最先进方法。