General-purpose AI models, particularly those designed for text and vision, demonstrate impressive versatility across a wide range of deep-learning tasks. However, they often underperform in specialised domains like medical imaging, where domain-specific solutions or alternative knowledge transfer approaches are typically required. Recent studies have noted that general-purpose models can exhibit similar latent spaces when processing semantically related data, although this alignment does not occur naturally. Building on this insight, it has been shown that applying a simple transformation - at most affine - estimated from a subset of semantically corresponding samples, known as anchors, enables model stitching across diverse training paradigms, architectures, and modalities. In this paper, we explore how semantic alignment - estimating transformations between anchors - can bridge general-purpose AI with specialised medical knowledge. Using multiple public chest X-ray datasets, we demonstrate that model stitching across model architectures allows general models to integrate domain-specific knowledge without additional training, leading to improved performance on medical tasks. Furthermore, we introduce a novel zero-shot classification approach for unimodal vision encoders that leverages semantic alignment across modalities. Our results show that our method not only outperforms general multimodal models but also approaches the performance levels of fully trained, medical-specific multimodal solutions
翻译:通用人工智能模型,特别是针对文本与视觉设计的模型,在广泛的深度学习任务中展现出卓越的泛化能力。然而,在医学影像等专业领域,它们往往表现欠佳,通常需要领域特定的解决方案或替代的知识迁移方法。近期研究指出,通用模型在处理语义相关的数据时可能展现出相似的潜在空间,尽管这种对齐并非自然发生。基于这一发现,已有研究表明,通过从语义对应样本子集(称为锚点)中估计一个至多为仿射的简单变换,能够实现跨不同训练范式、架构与模态的模型拼接。本文探讨了语义对齐——即估计锚点间的变换——如何能够桥接通用人工智能与专业医学知识。利用多个公开胸部X光数据集,我们证明了跨模型架构的模型拼接可使通用模型无需额外训练即可整合领域特定知识,从而提升医学任务的性能。此外,我们提出了一种新颖的单模态视觉编码器零样本分类方法,该方法利用了跨模态的语义对齐。我们的结果表明,该方法不仅优于通用多模态模型,而且性能接近完全训练的、医学专用的多模态解决方案。