Medical data poses a daunting challenge for AI algorithms: it exists in many different modalities, experiences frequent distribution shifts, and suffers from a scarcity of examples and labels. Recent advances, including transformers and self-supervised learning, promise a more universal approach that can be applied flexibly across these diverse conditions. To measure and drive progress in this direction, we present BenchMD: a benchmark that tests how well unified, modality-agnostic methods, including architectures and training techniques (e.g. self-supervised learning, ImageNet pretraining),perform on a diverse array of clinically-relevant medical tasks. BenchMD combines 19 publicly available datasets for 7 medical modalities, including 1D sensor data, 2D images, and 3D volumetric scans. Our benchmark reflects real-world data constraints by evaluating methods across a range of dataset sizes, including challenging few-shot settings that incentivize the use of pretraining. Finally, we evaluate performance on out-of-distribution data collected at different hospitals than the training data, representing naturally-occurring distribution shifts that frequently degrade the performance of medical AI models. Our baseline results demonstrate that no unified learning technique achieves strong performance across all modalities, leaving ample room for improvement on the benchmark. Code is released at https://github.com/rajpurkarlab/BenchMD.
翻译:医疗数据对人工智能算法提出了严峻挑战:其形式多样(涵盖多种模态)、常遭遇分布偏移,且面临样本与标签稀缺的问题。近年来,包括Transformer与自监督学习在内的前沿技术为构建可灵活应用于这些多样化场景的通用方法提供了可能。为衡量并推动该方向的发展,我们提出BenchMD:一个评估统一、模态无关方法(包括架构与训练技术,如自监督学习、ImageNet预训练)在多样化临床相关医疗任务中性能的基准测试。BenchMD整合了19个公开数据集,覆盖7种医疗模态,包含1D传感器数据、2D图像与3D容积扫描。通过评估方法在不同数据集规模(尤其是利用预训练应对的少样本挑战场景)下的表现,本基准反映了真实世界的数据约束。此外,我们首次在训练数据不同医院收集的分布外数据上评估性能,这些数据代表自然发生的分布偏移(此类偏移常导致医疗AI模型性能下降)。基线结果表明,当前尚无统一学习技术能在所有模态上取得优异性能,基准测试仍有显著提升空间。代码已开源至https://github.com/rajpurkarlab/BenchMD。