Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?

Although vision foundation models (VFMs) are increasingly reused for biomedical image analysis, it remains unclear whether the latent representations they provide are general enough to support effective transfer and reuse across heterogeneous microscopy image datasets. Here, we study this question for the problem of mitochondria segmentation in electron microscopy (EM) images, using two popular public EM datasets (Lucchi++ and VNC) and three recent representative VFMs (DINOv2, DINOv3, and OpenCLIP). We evaluate two practical model adaptation regimes: a frozen-backbone setting in which only a lightweight segmentation head is trained on top of the VFM, and parameter-efficient fine-tuning (PEFT) via Low-Rank Adaptation (LoRA) in which the VFM is fine-tuned in a targeted manner to a specific dataset. Across all backbones, we observe that training on a single EM dataset yields good segmentation performance (quantified as foreground Intersection-over-Union), and that LoRA consistently improves in-domain performance. In contrast, training on multiple EM datasets leads to severe performance degradation for all models considered, with only marginal gains from PEFT. Exploration of the latent representation space through various techniques (PCA, Fréchet Dinov2 distance, and linear probes) reveals a pronounced and persistent domain mismatch between the two considered EM datasets in spite of their visual similarity, which is consistent with the observed failure of paired training. These results suggest that, while VFMs can deliver competitive results for EM segmentation within a single domain under lightweight adaptation, current PEFT strategies are insufficient to obtain a single robust model across heterogeneous EM datasets without additional domain-alignment mechanisms.

翻译：尽管视觉基础模型（VFMs）在生物医学图像分析中被越来越多地复用，但其提供的潜在表征是否足够通用，以支持跨异构显微镜图像数据集的有效迁移与复用，目前尚不明确。本文针对电子显微镜（EM）图像中的线粒体分割问题，使用两个流行的公开EM数据集（Lucchi++ 和 VNC）以及三个近期具有代表性的VFMs（DINOv2、DINOv3 和 OpenCLIP），对这一问题进行了研究。我们评估了两种实用的模型适应机制：一种是冻结主干网络设置，仅在VFM之上训练一个轻量级分割头；另一种是通过低秩适应（LoRA）进行参数高效微调（PEFT），即以目标导向的方式将VFM微调到特定数据集。在所有主干网络中，我们观察到，在单个EM数据集上训练可获得良好的分割性能（以前景交并比量化），并且LoRA持续提升了域内性能。相比之下，在多个EM数据集上进行训练会导致所有考虑模型的性能严重下降，PEFT带来的增益微乎其微。通过各种技术（主成分分析、Fréchet DINOv2距离和线性探针）对潜在表征空间的探索表明，尽管两个考虑的EM数据集在视觉上相似，但它们之间存在显著且持续的域不匹配，这与观察到的联合训练失败现象一致。这些结果表明，虽然VFMs通过轻量级适应能在单一域内为EM分割提供有竞争力的结果，但当前的PEFT策略不足以在不引入额外域对齐机制的情况下，获得一个跨异构EM数据集的单一鲁棒模型。