From PhysioNet to Foundation Models -- A History and Potential Futures

from arxiv, 56 pages, 6 figures, 3 tables. Extended from: Gari D. Clifford. Past, Present and Future Challenges in Sharing Science: From PhysioNet to Foundation Models. 51st Computing in Cardiology, Karlsruhe, Germany, 51:1-4, 2024

Over the last 35 years, the sharing of medical data and models for research has evolved from sneakernet to the internet - from mailing magnetic tapes and compact discs of a handful of well-curated recordings, to the high-speed download of relatively comprehensive hospital databases. More recently, the fervor around the potential for modern machine learning and 'AI' to catapult us into the next industrial revolution has led to a seemingly insatiable desire to pump almost any source of data into large models. Although this has great potential, it also presents a whole set of new challenges. In this article I examine these trends over the last 30 years, drawing on examples from cardiology, one of the oldest data-intensive fields that is undergoing a renaissance via machine learning. From the early days of computerized cardiology, the Research Resource for Complex Physiologic Signals (PhysioNet) has been at the cutting edge of this field. This article, therefore, includes much of the Resource's history and the contributions drawn from 25 years of firsthand experience of co-developing elements of the Resource with its founders. I identify the most promising future directions for the PhysioNet Resource, and more generally, the growing issues and opportunities around dissemination and use of massive physiological databases, associated open access code, and public competitions, along with potential solutions to the key issues facing our field. Topics range from how we should approach foundation models in the context of the rapidly growing AI carbon footprint, to the potential of Tiny-ML and edge computing. I also cover issues around prizes and incentives, funding models, and scientific repeatability, as well as how we might address these issues by leveraging the PhysioNet Challenges, consistent with the philosophy of open-access from the early days of the PhysioNet Resource.

翻译：在过去35年间，用于研究的医学数据和模型共享已从“人工网络”演变为互联网——从邮寄少数精心整理记录的磁带和光盘，到高速下载相对全面的医院数据库。近年来，围绕现代机器学习和“人工智能”将我们推向下一次工业革命的狂热，催生了一种似乎永无止境的欲望，即将几乎任何数据源注入大型模型。尽管这具有巨大潜力，但也带来了一系列全新挑战。本文通过心脏病学领域的实例审视了过去30年的这些趋势。心脏病学是最早的数据密集型领域之一，目前正通过机器学习的复兴而焕发新生。从计算机化心脏病学的早期阶段起，复杂生理信号研究资源（PhysioNet）便一直处于该领域的前沿。因此，本文涵盖了该资源的大量历史，以及通过25年第一手经验——即与其创始人共同开发该资源部分组件——所积累的贡献。我指出了PhysioNet资源最具前景的未来方向，并更广泛地探讨了围绕大规模生理数据库的传播与使用、相关开源代码、公共竞赛日益凸显的问题与机遇，以及解决本领域关键问题的潜在方案。主题涵盖从如何在迅速增长的AI碳足迹背景下处理基础模型，到Tiny-ML和边缘计算的潜力。我还讨论了奖项与激励机制、资助模式、科学可重复性等问题，以及如何通过利用PhysioNet挑战赛来应对这些挑战——这延续了PhysioNet资源自早期创立以来所秉持的开源理念。