Vision-Language Models as Zero-Annotation Oracles in Histopathology

Foreground segmentation is the critical first step of every computational pathology pipeline, yet existing methods rely on hand-tuned heuristics or supervised models that overfit to narrow stain and scanner distributions, failing silently on specialised stains such as Jones silver or Elastica van Gieson. We propose a coarse-to-fine approach that recasts foreground segmentation as a visual perception task and leverages general-purpose vision-language models (VLMs) as zero-annotation oracles. Our key insight is that tissue-versus-background discrimination is a natural-image recognition problem, not a histopathological one, so VLMs trained on internet-scale corpora generalise where domain-specific models cannot. We introduce Leica-75, a benchmark of 75 renal transplant whole-slide images spanning three stain families. On Leica-75, our method achieves the highest segmentation quality on out-of-distribution stains (Dice 0.858 +/- 0.027 on Jones, 0.853 +/- 0.041 on EVG) with 7x lower cross-stain variance than the best supervised baseline, while remaining competitive on in-distribution H&E. Few-shot prompting with automatically curated exemplars (Auto-context) rescues hard cases on Stress-32 (n=32), a curated stress-test subset (Dice 0.470 to 0.819 for the 2B model). VLM-based annotation review matches human expert consensus (kappa=0.989 for blur detection; mean precision/recall grading accuracy 0.708 vs. human 0.646 for segmentation mask review). The resulting pseudo-labels are used to distil lightweight student models that are as performant as the teacher model while running for a fraction of the cost. Our framework provides a principled, scalable solution to a persistent infrastructure bottleneck in digital pathology.

翻译：前景分割是每个计算病理学流程中至关重要的第一步，但现有方法依赖于手工调参的启发式规则或监督模型，这些模型容易过度拟合狭窄的染色和扫描仪分布，在Jones银染或Elastica van Gieson等特殊染色上会静默失效。我们提出一种由粗到精的方法，将前景分割重新定义为视觉感知任务，并利用通用视觉-语言模型（VLM）作为零标注先知。我们的关键洞察是：组织与背景的区分是一个自然图像识别问题而非组织病理学问题，因此在互联网规模语料上训练的VLM能够在领域特定模型无法泛化的场景中实现泛化。我们引入了Leica-75基准数据集，包含跨越三种染色家族的75张肾移植全切片图像。在Leica-75上，我们的方法在分布外染色上取得了最高分割质量（Jones: Dice 0.858±0.027, EVG: 0.853±0.041），交叉染色方差比最佳监督基线的7倍更低，同时在分布内H&E染色上保持竞争力。通过自动筛选示例的少样本提示（自动上下文）挽救了Stress-32（n=32，精选压力测试子集）中的困难样本（2B模型Dice从0.470提升至0.819）。基于VLM的标注审查与人类专家共识达成一致（模糊检测kappa=0.989；分割掩膜审查平均精度/召回率评分为0.708，对比人类0.646）。生成的伪标签用于蒸馏轻量学生模型，其性能与教师模型相当，但运行成本仅为后者的一小部分。我们的框架为数字病理学中持续存在的瓶颈问题提供了原则性且可扩展的解决方案。