Progress in a research field can be hard to assess, in particular when many concurrent methods are proposed in a short period of time. This is the case in digital pathology, where many foundation models have been released recently to serve as feature extractors for tile-level images, being used in a variety of downstream tasks, both for tile- and slide-level problems. Benchmarking available methods then becomes paramount to get a clearer view of the research landscape. In particular, in critical domains such as healthcare, a benchmark should not only focus on evaluating downstream performance, but also provide insights about the main differences between methods, and importantly, further consider uncertainty and robustness to ensure a reliable usage of proposed models. For these reasons, we introduce THUNDER, a tile-level benchmark for digital pathology foundation models, allowing for efficient comparison of many models on diverse datasets with a series of downstream tasks, studying their feature spaces and assessing the robustness and uncertainty of predictions informed by their embeddings. THUNDER is a fast, easy-to-use, dynamic benchmark that can already support a large variety of state-of-the-art foundation, as well as local user-defined models for direct tile-based comparison. In this paper, we provide a comprehensive comparison of 23 foundation models on 16 different datasets covering diverse tasks, feature analysis, and robustness. The code for THUNDER is publicly available at https://github.com/MICS-Lab/thunder.
翻译:研究领域的进展往往难以评估,尤其是在短时间内涌现出多种并行方法的情况下。数字病理学领域便是如此,近期发布了众多基础模型作为图块级图像的特征提取器,广泛应用于各类下游任务,涵盖图块级和玻片级问题。因此,对现有方法进行基准测试对于厘清研究格局至关重要。尤其在医疗等关键领域,基准测试不仅应关注下游性能评估,还需揭示方法间的核心差异,并进一步考量不确定性与鲁棒性,以确保所提模型的可靠应用。基于此,我们推出THUNDER——一个面向数字病理学基础模型的图块级基准测试框架。该框架支持在多样化数据集上通过系列下游任务高效比较多种模型,研究其特征空间,并基于嵌入向量评估预测的鲁棒性与不确定性。THUNDER作为快速、易用的动态基准平台,目前已兼容大量前沿基础模型及用户自定义本地模型,支持直接的图块级对比研究。本文通过对23个基础模型在16个不同数据集上进行综合比较,涵盖多类任务、特征分析与鲁棒性评估。THUNDER的代码已公开于https://github.com/MICS-Lab/thunder。