Evaluating Computational Pathology Foundation Models for Prostate Cancer Grading under Distribution Shifts

Pathology foundation models (PFMs) have emerged as powerful pretrained encoders for computational pathology, but their robustness under clinically relevant distribution shifts remains insufficiently understood. We benchmark the robustness of recent PFMs in the setting of prostate cancer grading from whole-slide images (WSIs). Using the PANDA dataset, we evaluate PFMs as frozen patch-level feature extractors within weakly supervised slide-level grading models, and assess robustness to two important forms of distribution shift: shifts in WSI image appearance across collection sites, and shifts in the label distribution over cancer grade groups. Across in-distribution settings, PFMs consistently achieve strong performance and clearly outperform a natural-image baseline. Under cross-site transfer from Radboud to Karolinska, however, performance drops substantially for all models, showing that large-scale pretraining alone does not guarantee robust downstream generalization. In contrast, PFMs are less sensitive to label-distribution shift, indicating that visually grounded domain shift is the dominant challenge. Representation analysis further supports these findings by revealing persistent domain separation between sites across all PFMs. While grade-related structure is present, it is comparatively weak, indicating that domain-related variation dominates in the learned feature space. Together, these results provide a comprehensive benchmark of PFMs under distribution shift and highlight an important practical message: although PFMs provide strong representations, generalizability remains constrained by the quality and diversity of the data used to train downstream prediction models.

翻译：病理基础模型（PFM）已成为计算病理学中强大的预训练编码器，但其在临床相关分布偏移下的鲁棒性尚未得到充分理解。我们以全景切片图像（WSI）的前列腺癌分级为场景，对近期PFM的鲁棒性进行了基准测试。利用PANDA数据集，我们将PFM作为弱监督切片级分级模型中的冻结图块级特征提取器进行评估，并检验其对两种重要分布偏移的鲁棒性：不同采集站点间WSI图像外观的偏移，以及癌症分级标签分布的偏移。在分布内设置下，PFM始终表现优异，且明显优于自然图像基线。然而，在从Radboud到Karolinska的跨站点迁移中，所有模型的性能均显著下降，表明大规模预训练本身并不能保证鲁棒的下游泛化能力。相比之下，PFM对标签分布偏移的敏感性较低，这表明视觉上的域偏移是主要挑战。表征分析进一步支持了这些发现，揭示了所有PFM在不同站点间均存在持续的域分离。尽管存在与分级相关的结构，但其相对较弱，表明在学习的特征空间中，域相关变异占据主导地位。综合来看，这些结果为分布偏移下的PFM提供了全面的基准测试，并强调了一个重要的实践启示：尽管PFM提供了强大的表征，但其泛化能力仍受限于下游预测模型训练数据的质量与多样性。