A Comprehensive Benchmark of Histopathology Foundation Models for Kidney Digital Pathology Images

Harishwar Reddy Kasireddy,Patricio S. La Rosa,Akshita Gupta,Anindya S. Paul,Jamie L. Fermin,William L. Clapp,Meryl A. Waldman,Tarek M. El-Ashkar,Sanjay Jain,Luis Rodrigues,Kuang Yu Jen,Avi Z. Rosenberg,Michael T. Eadon,Jeffrey B. Hodgin,Pinaki Sarder

from arxiv, 31 Pages, 14 Tables, 12 figures, Co-correspondence to [email protected] and [email protected]

Histopathology foundation models (HFMs), pretrained on large-scale cancer datasets, have advanced computational pathology. However, their applicability to non-cancerous chronic kidney disease remains underexplored, despite coexistence of renal pathology with malignancies such as renal cell and urothelial carcinoma. We systematically evaluate 11 publicly available HFMs across 11 kidney-specific downstream tasks spanning multiple stains (PAS, H&E, PASM, and IHC), spatial scales (tile and slide-level), task types (classification, regression, and copy detection), and clinical objectives, including detection, diagnosis, and prognosis. Tile-level performance is assessed using repeated stratified group cross-validation, while slide-level tasks are evaluated using repeated nested stratified cross-validation. Statistical significance is examined using Friedman test followed by pairwise Wilcoxon signed-rank testing with Holm-Bonferroni correction and compact letter display visualization. To promote reproducibility, we release an open-source Python package, kidney-hfm-eval, available at https://pypi.org/project/kidney-hfm-eval/ , that reproduces the evaluation pipelines. Results show moderate to strong performance on tasks driven by coarse meso-scale renal morphology, including diagnostic classification and detection of prominent structural alterations. In contrast, performance consistently declines for tasks requiring fine-grained microstructural discrimination, complex biological phenotypes, or slide-level prognostic inference, largely independent of stain type. Overall, current HFMs appear to encode predominantly static meso-scale representations and may have limited capacity to capture subtle renal pathology or prognosis-related signals. Our results highlight the need for kidney-specific, multi-stain, and multimodal foundation models to support clinically reliable decision-making in nephrology.

翻译：组织病理学基础模型（HFMs）通过在大规模癌症数据集上预训练，推动了计算病理学的发展。然而，尽管肾脏病理与恶性肿瘤（如肾细胞癌和尿路上皮癌）共存，这些模型在非癌性慢性肾病中的适用性仍未被充分探索。我们系统评估了11个公开可用的HFMs，涵盖11项肾脏特异性下游任务，这些任务涉及多种染色方式（PAS、H&E、PASM和IHC）、空间尺度（图块级和切片级）、任务类型（分类、回归和拷贝检测）及临床目标（包括检测、诊断和预后）。图块级性能通过重复分层组交叉验证进行评估，而切片级任务则采用重复嵌套分层交叉验证。统计显著性采用Friedman检验，随后进行配对Wilcoxon符号秩检验、Holm-Bonferroni校正及紧凑字母显示可视化。为促进可重复性，我们发布开源Python工具包kidney-hfm-eval（访问地址：https://pypi.org/project/kidney-hfm-eval/），可复现评估流程。结果表明，在由粗糙中尺度肾脏形态驱动的任务中（包括诊断分类和显著结构改变的检测），模型表现出中等至强性能。相比之下，在需要细微显微结构判别、复杂生物学表型或切片级预后推断的任务中，性能持续下降，且与染色类型基本无关。总体而言，当前HFMs主要编码静态中尺度表征，可能难以捕捉微妙的肾脏病理或预后相关信号。我们的结果强调，需要开发肾脏特异性、多染色及多模态基础模型，以支持肾病学中具备临床可靠性的决策。