The growing demand for machine learning in healthcare requires processing increasingly large electronic health record (EHR) datasets, but existing pipelines are not computationally efficient or scalable. In this paper, we introduce meds_reader, an optimized Python package for efficient EHR data processing that is designed to take advantage of many intrinsic properties of EHR data for improved speed. We then demonstrate the benefits of meds_reader by reimplementing key components of two major EHR processing pipelines, achieving 10-100x improvements in memory, speed, and disk usage. The code for meds_reader can be found at https://github.com/som-shahlab/meds_reader.
翻译:医疗保健领域对机器学习日益增长的需求,要求处理规模不断扩大的电子健康记录(EHR)数据集,但现有的处理流程在计算效率或可扩展性方面存在不足。本文介绍了meds_reader,一个经过优化的Python软件包,用于高效处理EHR数据。其设计旨在利用EHR数据的诸多内在特性以提升处理速度。我们通过重新实现两个主流EHR处理流程的关键组件,展示了meds_reader的优势,在内存占用、处理速度和磁盘使用方面实现了10至100倍的提升。meds_reader的代码可在 https://github.com/som-shahlab/meds_reader 获取。