A multimodal vision foundation model for generalizable knee pathology

Musculoskeletal disorders represent a leading cause of global disability, creating an urgent demand for precise interpretation of medical imaging. Current artificial intelligence (AI) approaches in orthopedics predominantly rely on task-specific, supervised learning paradigms. These methods are inherently fragmented, require extensive annotated datasets, and often lack generalizability across different modalities and clinical scenarios. The development of foundation models in this field has been constrained by the scarcity of large-scale, curated, and open-source musculoskeletal datasets. To address these challenges, we introduce OrthoFoundation, a multimodal vision foundation model optimized for musculoskeletal pathology. We constructed a pre-training dataset of 1.2 million unlabeled knee X-ray and MRI images from internal and public databases. Utilizing a Dinov3 backbone, the model was trained via self-supervised contrastive learning to capture robust radiological representations. OrthoFoundation achieves state-of-the-art (SOTA) performance across 14 downstream tasks. It attained superior accuracy in X-ray osteoarthritis diagnosis and ranked first in MRI structural injury detection. The model demonstrated remarkable label efficiency, matching supervised baselines using only 50% of labeled data. Furthermore, despite being pre-trained on knee images, OrthoFoundation exhibited exceptional cross-anatomy generalization to the hip, shoulder, and ankle. OrthoFoundation represents a significant advancement toward general-purpose AI for musculoskeletal imaging. By learning fundamental, joint-agnostic radiological semantics from large-scale multimodal data, it overcomes the limitations of conventional models, which provides a robust framework for reducing annotation burdens and enhancing diagnostic accuracy in clinical practice.

翻译：肌肉骨骼疾病是全球致残的主要原因之一，迫切需要精确解读医学影像。目前骨科领域的人工智能（AI）方法主要依赖于任务特定的监督学习范式。这些方法本质上是碎片化的，需要大量标注数据集，且通常缺乏跨不同模态和临床场景的泛化能力。该领域基础模型的发展一直受限于大规模、高质量、开源肌肉骨骼数据集的稀缺。为应对这些挑战，我们提出了OrthoFoundation——一种专为肌肉骨骼病理学优化的多模态视觉基础模型。我们构建了一个包含来自内部和公共数据库的120万张未标注膝关节X射线和MRI图像的预训练数据集。该模型采用Dinov3主干网络，通过自监督对比学习进行训练，以捕获稳健的放射学表征。OrthoFoundation在14项下游任务中均实现了最先进的性能表现：在X射线骨关节炎诊断中达到卓越准确率，并在MRI结构损伤检测中排名第一。该模型展现出显著的标签效率，仅使用50%的标注数据即可匹配监督基线的性能。此外，尽管仅在膝关节图像上进行预训练，OrthoFoundation对髋关节、肩关节和踝关节表现出卓越的跨解剖结构泛化能力。OrthoFoundation代表了肌肉骨骼影像通用人工智能的重要进展。通过从大规模多模态数据中学习基础的、关节无关的放射学语义，它克服了传统模型的局限性，为减轻临床实践中的标注负担和提高诊断准确性提供了稳健框架。