This position paper argues that current benchmarking practice in 12-lead ECG representation learning must be fixed to ensure progress is reliable and aligned with clinically meaningful objectives. The field has largely converged on three public multi-label benchmarks (PTB-XL, CPSC2018, CSN) dominated by arrhythmia and waveform-morphology labels, even though the ECG is known to encode substantially broader clinical information. We argue that downstream evaluation should expand to include an assessment of structural heart disease and patient-level forecasting, in addition to other evolving ECG-related endpoints, as relevant clinical targets. Next, we outline evaluation best practices for multi-label, imbalanced settings, and show that when they are applied, the literature's current conclusion about which representations perform best is altered. Furthermore, we demonstrate the surprising result that a randomly initialized encoder with linear evaluation matches state-of-the-art pre-training on many tasks. This motivates the use of a random encoder as a reasonable baseline model. We substantiate our observations with an empirical evaluation of three representative ECG pre-training approaches across six evaluation settings: the three standard benchmarks, a structural disease dataset, hemodynamic inference, and patient forecasting.
翻译:本立场论文指出,当前12导联心电图表征学习的基准测试实践必须修正,以确保研究进展的可靠性并与临床意义目标保持一致。该领域目前主要集中于三个公开多标签基准数据集(PTB-XL、CPSC2018、CSN),这些数据集以心律失常和波形形态标签为主导,尽管已知心电图编码的临床信息范围远不止于此。我们认为,下游评估应扩展到包含结构性心脏病评估和患者层面预测,以及其他不断发展的心电图相关终点指标,作为相关的临床目标。接着,我们概述了多标签不平衡场景下的评估最佳实践,并证明当应用这些实践时,文献中关于何种表征性能最佳的现有结论会发生改变。此外,我们展示了一个令人惊讶的结果:随机初始化的编码器配合线性评估,在许多任务上可匹配最先进的预训练方法。这促使我们将随机编码器作为合理的基线模型。我们通过对三种代表性心电图预训练方法在六种评估场景下的实证评估来证实这些观察:三个标准基准数据集、一个结构性心脏病数据集、血流动力学推理以及患者预测。