Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities, yet their application to medical ultrasound remains constrained by the significant domain shift between natural images and sonographic data. The unique physics of ultrasound, manifesting as speckle noise, shadowing, and variable artifacts, often leads to suboptimal performance when applying off-the-shelf foundation models. To address this, we propose a novel Hybrid-tuning (HT) strategy for the efficient adaptation of CLIP-based models to ultrasound analysis. Our method introduces a lightweight adapter module integrated into the frozen visual backbone, featuring frequency-domain filtering to suppress periodic artifacts and dynamic noise estimation to calibrate feature representations. Furthermore, we design specialized segmentation and classification heads that employ multi-scale feature aggregation to maximize the utility of pre-trained semantic priors. Extensive evaluations across six multi-center datasets (covering lymph nodes, breast, thyroid, and prostate) reveal that our HT-enhanced models significantly outperform existing state-of-the-art methods, including BiomedCLIP and standard LoRA fine-tuning. The results highlight the superior data efficiency and robustness of our approach, paving the way for practical, foundational intelligence in automated ultrasound diagnosis. The source code is available at https://github.com/jinggqu/NextGen-UIA.
翻译:视觉-语言模型(VLMs)已展现出卓越的泛化能力,但其在医学超声领域的应用仍受限于自然图像与超声影像数据之间的显著域偏移。超声成像特有的物理特性,表现为斑点噪声、声影及多变的伪影,导致直接应用现成的基础模型时性能往往欠佳。为解决此问题,我们提出一种新颖的混合调优(HT)策略,用于将基于CLIP的模型高效适配于超声图像分析。该方法在冻结的视觉主干网络中引入一个轻量级适配器模块,该模块包含频域滤波以抑制周期性伪影,以及动态噪声估计以校准特征表示。此外,我们设计了专用的分割与分类头,采用多尺度特征聚合以最大化利用预训练的语义先验。在涵盖淋巴结、乳腺、甲状腺及前列腺的六个多中心数据集上进行广泛评估,结果表明,经HT增强的模型显著超越了包括BiomedCLIP和标准LoRA微调在内的现有最先进方法。这些结果凸显了本方法在数据效率和鲁棒性方面的优越性,为自动化超声诊断中实用化基础智能的发展铺平了道路。源代码发布于 https://github.com/jinggqu/NextGen-UIA。