Foundation Models (FMs) have recently redefined the state-of-the-art in histopathology by providing robust representations for whole-slide image (WSI) analysis. However, selecting the optimal foundation model (FM) for a specific clinical cohort currently requires multiple preprocessing steps, followed by computationally expensive feature extraction and the training of a Multiple Instance Learning (MIL) aggregator for every model. In this work, we investigate whether efficient tile-level linear probing can serve as a reliable proxy for slide-level performance, reducing the need to run full slide-level pipelines for every candidate encoder. We benchmark 19 state-of-the-art FMs on 42 slide-level and 16 tile-level tasks, comparing tile probing metrics against slide-level outcomes using ABMIL and Mean Pooling aggregations. We observe a high correlation between tile and slide performance across varying task difficulties, indicating that encoder representation quality is the primary determinant of WSI success. Sensitivity analyses show that transferability is stable across models and is more influenced by cohort sizes and numbers of tiles per slide than by average task difficulty. We also measure the agreement in best performing models between tile and slide-level tasks, showing tile benchmarks reliably shortlist strong candidates. Overall, our study indicates that tile-level benchmarking provides an efficient and practical first step for narrowing down candidate models, while slide-level evaluation remains essential for final validation on clinical tasks.
翻译:基础模型(FMs)近期通过为全切片图像(WSI)分析提供稳健表征,重新定义了组织病理学领域的技术水平。然而,为特定临床队列选择最优基础模型目前需经过多步预处理,随后需对每个模型进行高计算成本的特征提取及多实例学习(MIL)聚合器训练。本研究探讨了高效的切块级线性探测能否作为全切片性能的可靠代理指标,从而减少对每个候选编码器运行完整全切片管道的需求。我们在42个全切片级和16个切块级任务上对19个前沿基础模型进行基准测试,通过ABMIL和均值池化聚合方法比较切块探测指标与全切片级结果。观察到在不同任务难度下切块与全切片性能间存在高度相关性,表明编码器表征质量是WSI成功的主要决定因素。敏感性分析显示,跨模型的可迁移性保持稳定,且受队列规模及每切片切块数量的影响大于平均任务难度。我们还测量了切块级与全切片级任务中最优模型的一致性,证明切块基准测试可可靠地筛选出强候选模型。总体而言,本研究表明切块级基准测试可作为缩小候选模型范围的高效实用初始步骤,而全切片级评估对临床任务最终验证仍不可或缺。