Reproducibility remains a significant challenge in machine learning (ML) for healthcare. In this field, datasets, model pipelines, and even task/cohort definitions are often private, leading to a significant barrier in sharing, iterating, and understanding ML results on electronic health record (EHR) datasets. In this paper, we address a significant part of this problem by introducing the Automatic Cohort Extraction System for Event-Stream Datasets (ACES). This tool is designed to simultaneously simplify the development of task/cohorts for ML in healthcare and enable the reproduction of these cohorts, both at an exact level for single datasets and at a conceptual level across datasets. To accomplish this, ACES provides (1) a highly intuitive and expressive configuration language for defining both dataset-specific concepts and dataset-agnostic inclusion/exclusion criteria, and (2) a pipeline to automatically extract patient records that meet these defined criteria from real-world data. ACES can be automatically applied to any dataset in either the Medical Event Data Standard (MEDS) or EventStreamGPT (ESGPT) formats, or to *any* dataset for which the necessary task-specific predicates can be extracted in an event-stream form. ACES has the potential to significantly lower the barrier to entry for defining ML tasks, redefine the way researchers interact with EHR datasets, and significantly improve the state of reproducibility for ML studies in this modality. ACES is available at https://github.com/justin13601/aces.
翻译:在医疗健康领域的机器学习研究中,可复现性仍然是一个重大挑战。该领域的数据集、模型流水线乃至任务/队列定义通常具有私有性,这为在电子健康记录数据集上共享、迭代和理解机器学习结果设置了显著障碍。本文通过引入面向事件流数据集的自动队列提取系统来解决该问题的重要部分。该工具旨在同时简化医疗机器学习任务/队列的开发流程,并支持对这些队列的复现——既可在单一数据集上实现精确复现,也能在不同数据集间实现概念级复现。为实现这一目标,ACES提供:(1) 用于定义数据集特定概念及数据集无关的纳入/排除标准的高度直观且富有表现力的配置语言;(2) 从真实世界数据中自动提取符合定义标准的患者记录的流水线。ACES可自动应用于符合医疗事件数据标准或EventStreamGPT格式的任何数据集,或任何能以事件流形式提取必要任务特定谓词的数据集。该系统有望显著降低定义机器学习任务的门槛,重新定义研究者与电子健康记录数据集的交互方式,并实质性提升该模态下机器学习研究的可复现性水平。ACES已在https://github.com/justin13601/aces开源发布。