Benchmarking Pathology Foundation Models for Breast Cancer Survival Prediction

Pathology foundation models (PFMs) have recently emerged as powerful pretrained encoders for computational pathology, enabling transfer learning across a wide range of downstream tasks. However, systematic comparisons of these models for clinically meaningful prediction problems remain limited, especially in the context of survival prediction under external validation. In this study, we benchmark widely used and recently proposed PFMs for breast cancer survival prediction from whole-slide histopathology images. Using a standardized pipeline based on patch-level feature extraction and a unified survival modeling framework, we evaluate model representations across three independent clinical cohorts comprising more than 5,400 patients with long-term follow-up. Models are trained on one cohort and evaluated on two independent external cohorts, enabling a rigorous assessment of cross-dataset generalization. Overall, H-optimus-1 achieves the strongest survival prediction performance. More broadly, we observe consistent generational improvements across model families, with second-generation PFMs outperforming their first-generation counterparts. However, absolute performance differences between many recent PFMs remain modest, suggesting diminishing returns from further scaling of pretraining data or model size alone. Notably, the compact distilled model H0-mini slightly outperforms its larger teacher model H-optimus-0, despite using fewer than 8% of the parameters and enabling significantly faster feature extraction. Together, these results provide the first large-scale, externally validated benchmark of PFMs for breast cancer survival prediction, and offer practical guidance for efficient deployment of PFMs in clinical workflows.

翻译：病理基础模型（PFMs）近期已成为计算病理学中强大的预训练编码器，可支持跨多种下游任务的迁移学习。然而，这些模型在具有临床意义的预测问题上缺乏系统性比较，尤其是在外部验证情景下的生存预测方面。本研究系统评估了广泛使用及近期提出的PFMs在基于全切片组织病理学图像的乳腺癌生存预测中的表现。通过采用基于补丁级特征提取的标准化流程与统一生存建模框架，我们在三个独立临床队列（涵盖5400余名具有长期随访记录的患者）中评估了模型表征能力。所有模型在一个队列上训练，并在两个独立外部队列上验证，从而实现了跨数据集泛化性的严格评估。综合来看，H-optimus-1取得了最优的生存预测性能。更广泛而言，我们观察到模型家族内部的代际持续改进，第二代PFMs表现优于第一代同类模型。然而，多数近期PFMs间的绝对性能差异较小，表明单纯扩大预训练数据量或模型规模带来的边际收益递减。值得注意的是，轻量级蒸馏模型H0-mini虽仅使用不足8%的参数并实现了显著更快的特征提取速度，其性能仍略优于更大的教师模型H-optimus-0。综上，本研究首次构建了用于乳腺癌生存预测的、经大规模外部验证的PFMs基准，并为PFMs在临床工作流中的高效部署提供了实践指导。