Large-scale visual-language pre-trained models (VLPM) have proven their excellent performance in downstream object detection for natural scenes. However, zero-shot nuclei detection on H\&E images via VLPMs remains underexplored. The large gap between medical images and the web-originated text-image pairs used for pre-training makes it a challenging task. In this paper, we attempt to explore the potential of the object-level VLPM, Grounded Language-Image Pre-training (GLIP) model, for zero-shot nuclei detection. Concretely, an automatic prompts design pipeline is devised based on the association binding trait of VLPM and the image-to-text VLPM BLIP, avoiding empirical manual prompts engineering. We further establish a self-training framework, using the automatically designed prompts to generate the preliminary results as pseudo labels from GLIP and refine the predicted boxes in an iterative manner. Our method achieves a remarkable performance for label-free nuclei detection, surpassing other comparison methods. Foremost, our work demonstrates that the VLPM pre-trained on natural image-text pairs exhibits astonishing potential for downstream tasks in the medical field as well. Code will be released at https://github.com/wuyongjianCODE/VLPMNuD.
翻译:大规模视觉语言预训练模型(VLPM)已证明其在自然场景下游目标检测任务中的卓越性能。然而,通过VLPM对H&E染色图像进行零样本细胞核检测这一方向仍鲜有探索。医学图像与预训练所用网络来源图文对之间的巨大差异使得该任务极具挑战性。本文尝试探索目标级VLPM——基于语言与图像预训练的接地模型(GLIP)——在零样本细胞核检测中的潜力。具体而言,我们基于VLPM的关联绑定特性及图像到文本VLPM模型BLIP,设计了一套自动化提示生成流程,从而避免经验性的人工提示工程。进一步,我们构建了自训练框架,利用自动生成的提示从GLIP中获取初步检测结果作为伪标签,并通过迭代方式优化预测边界框。我们的方法在无标签细胞核检测任务中取得了显著性能,超越了其他对比方法。尤为重要的是,本研究证明:基于自然图文对预训练的VLPM在医学领域下游任务中同样展现出惊人潜力。代码将开源至https://github.com/wuyongjianCODE/VLPMNuD。