GOLDMARK: Governed Outcome-Linked Diagnostic Model Assessment Reference Kit

Chad Vanderbilt,Gabriele Campanella,Siddharth Singi,Swaraj Nanda,Jie-Fu Chen,Ali Kamali,Amir Momeni Boroujeni,David Kim,Mohamed Yakoub,Jamal Benhamida,Meera Hameed,Neeraj Kumar,Gregory Goldgof

Computational biomarkers (CBs) are histopathology-derived patterns extracted from hematoxylin-eosin (H&E) whole-slide images (WSIs) using artificial intelligence (AI) to predict therapeutic response or prognosis. Recently, slide-level multiple-instance learning (MIL) with pathology foundation models (PFMs) has become the standard baseline for CB development. While these methods have improved predictive performance, computational pathology lacks standardized intermediate data formats, provenance tracking, checkpointing conventions, and reproducible evaluation metrics required for clinical-grade deployment. We introduce GOLDMARK (https://artificialintelligencepathology.org), a standardized benchmarking framework built on a curated TCGA cohort with clinically actionable OncoKB level 1-3 biomarker labels. GOLDMARK releases structured intermediate representations, including tile coordinate maps, per-slide feature embeddings from canonical PFMs, quality-control metadata, predefined patient-level splits, trained slide-level models, and evaluation outputs. Models are trained on TCGA and evaluated on an independent MSKCC cohort with reciprocal testing. Across 33 tumor-biomarker tasks, mean AUROC was 0.689 (TCGA) and 0.630 (MSKCC). Restricting to the eight highest-performing tasks yielded mean AUROCs of 0.831 and 0.801, respectively. These tasks correspond to established morphologic-genomic associations (e.g., LGG IDH1, COAD MSI/BRAF, THCA BRAF/NRAS, BLCA FGFR3, UCEC PTEN) and showed the most stable cross-site performance. Differences between canonical encoders were modest relative to task-specific variability. GOLDMARK establishes a shared experimental substrate for computational pathology, enabling reproducible benchmarking and direct comparison of methods across datasets and models.

翻译：计算生物标志物是通过人工智能从苏木精-伊红染色全切片图像中提取的组织病理学模式，用于预测治疗反应或预后。目前，基于病理基础模型的切片级多实例学习已成为计算生物标志物开发的标准基线方法。尽管这些方法提升了预测性能，但计算病理学领域仍缺乏临床级部署所需的标准化中间数据格式、溯源追踪、检查点约定及可复现评估指标。我们提出GOLDMARK（https://artificialintelligencepathology.org），这是一个基于经过整理的TCGA队列构建的标准化基准测试框架，该队列包含具有临床可操作性的OncoKB 1-3级生物标志物标签。GOLDMARK发布了结构化的中间表示，包括切片坐标图、来自经典病理基础模型的切片级特征嵌入、质量控制元数据、预定义的患者级别数据划分、训练好的切片级模型及评估输出。模型在TCGA数据集上训练，并在独立的MSKCC队列上进行评估，同时进行交互测试。在33项肿瘤-生物标志物任务中，平均AUC值在TCGA上为0.689，在MSKCC上为0.630。在八项表现最佳的任务中，平均AUC值分别达到0.831和0.801。这些任务对应于已建立的形态-基因组关联（例如LGG IDH1、COAD MSI/BRAF、THCA BRAF/NRAS、BLCA FGFR3、UCEC PTEN），并展现出最稳定的跨站点性能。经典编码器之间的差异相对于任务特异性变异较小。GOLDMARK为计算病理学建立了共享的实验基底，实现了跨数据集和跨模型方法的可复现基准测试与直接比较。