From PhysioNet to Foundation Models -- A history and potential futures

from arxiv, 56 pages, 6 figures, 3 tables. Extended from: Gari D. Clifford. Past, Present and Future Challenges in Sharing Science: From PhysioNet to Foundation Models. 51st Computing in Cardiology, Karlsruhe, Germany, 51:1-4, 2024

Over the last 35 years, the sharing of medical data and models for research has evolved from sneakernet to the internet - from mailing magnetic tapes and compact discs of a handful of well-curated recordings, to the high-speed download of relatively comprehensive hospital databases. More recently, the fervor around the potential for modern machine learning and 'AI' to catapult us into the next industrial revolution has led to a seemingly insatiable desire to pump almost any source of data into large models. Although this has great potential, it also presents a whole set of new challenges. In this article I examine these trends over the last 30 years, drawing on examples from cardiology, one of the oldest data-intensive fields that is undergoing a renaissance via machine learning. From the early days of computerized cardiology, the Research Resource for Complex Physiologic Signals (PhysioNet) has been at the cutting edge of this field. This article, therefore, includes much of the Resource's history and the contributions drawn from 25 years of firsthand experience of co-developing elements of the Resource with its founders. I identify the most promising future directions for the PhysioNet Resource, and more generally, the growing issues and opportunities around dissemination and use of massive physiological databases, associated open access code, and public competitions, along with potential solutions to the key issues facing our field. Topics range from how we should approach foundation models in the context of the rapidly growing AI carbon footprint, to the potential of Tiny-ML and edge computing. I also cover issues around prizes and incentives, funding models, and scientific repeatability, as well as how we might address these issues by leveraging the PhysioNet Challenges, consistent with the philosophy of open-access from the early days of the PhysioNet Resource.

翻译：在过去的35年间，医疗数据与模型的研究共享方式经历了从人工传递到网络传输的演变——从邮寄少量精心整理的磁记录带和光盘，发展到高速下载相对完整的医院数据库。近年来，现代机器学习与"人工智能"推动下一场工业革命的潜力引发了狂热浪潮，导致人们似乎永不满足地试图将几乎所有数据源注入大型模型。尽管这具有巨大潜力，但也带来了一系列全新挑战。本文通过心脏病学（这一正通过机器学习经历复兴的最古老数据密集型领域之一）的案例，审视了过去30年的发展趋势。自计算机化心脏病学早期阶段起，复杂生理信号研究资源库（PhysioNet）始终处于该领域的前沿。因此，本文涵盖了该资源库的大量历史内容，以及笔者与其创始人共同开发资源库组件25年来获得的经验与贡献。本文指出了PhysioNet资源库最具前景的未来发展方向，并更广泛地探讨了海量生理数据库、相关开源代码及公开竞赛在传播与使用过程中日益凸显的问题与机遇，同时提出了应对本领域关键挑战的潜在解决方案。讨论议题涵盖：在人工智能碳足迹快速增长背景下应如何构建基础模型、微型机器学习与边缘计算的潜力等。本文还探讨了奖项激励机制、资助模式、科学可重复性等问题，以及如何通过延续PhysioNet资源库早期的开放获取理念，借助PhysioNet挑战赛机制应对这些挑战。