The electrocardiogram (ECG) is a cost-effective, highly accessible and widely employed diagnostic tool. With the advent of Foundation Models (FMs), the field of AI-assisted ECG interpretation has begun to evolve, as they enable model reuse across different tasks by relying on embeddings. However, to responsibly employ FMs, it is crucial to rigorously assess to which extent the embeddings they produce are generalizable, particularly in error-sensitive domains such as healthcare. Although prior works have already addressed the problem of benchmarking ECG-expert FMs, they focus predominantly on the evaluation of downstream performance. To fill this gap, this study aims to find an in-depth, comprehensive benchmarking framework for FMs, with a specific focus on ECG-expert ones. To this aim, we introduce a benchmark methodology that complements performance-based evaluation with representation-level analysis, leveraging SHAP and UMAP techniques. Furthermore, we rely on the methodology for carrying out an extensive evaluation of several ECG-expert FMs pretrained via state-of-the-art techniques over different cross-continental datasets and data availability settings; this includes ones featuring data scarcity, a fairly common situation in real-world medical scenarios. Experimental results show that our benchmarking protocol provides a rich insight of ECG-expert FMs' embedded patterns, enabling a deeper understanding of their representational structure and generalizability.
翻译:心电图是一种经济高效、高度可及且广泛应用的诊断工具。随着基础模型的出现,人工智能辅助心电图解读领域开始发展,因为这些模型通过依赖嵌入表示实现了跨任务模型复用。然而,为负责任地应用基础模型,必须严格评估其生成的嵌入表示在多大程度上具有泛化能力,尤其是在医疗保健等对错误敏感的领域。尽管先前研究已涉及心电图专用基础模型的基准测试问题,但其主要关注下游性能评估。为填补这一空白,本研究旨在建立一个深入、全面的基础模型基准测试框架,特别聚焦于心电图专用模型。为此,我们提出一种基准测试方法,该方法通过结合SHAP和UMAP技术进行表示层分析,以补充基于性能的评估。此外,我们运用该方法对多个心电图专用基础模型进行了广泛评估,这些模型通过前沿技术在不同跨大陆数据集及数据可用性场景下进行预训练,包括数据稀缺场景——这是真实医疗环境中相当常见的情况。实验结果表明,我们的基准测试协议能够深入揭示心电图专用基础模型的嵌入模式,从而促进对其表示结构和泛化能力的更深刻理解。