GENIE: Generative Note Information Extraction model for structuring EHR data

Electronic Health Records (EHRs) hold immense potential for advancing healthcare, offering rich, longitudinal data that combines structured information with valuable insights from unstructured clinical notes. However, the unstructured nature of clinical text poses significant challenges for secondary applications. Traditional methods for structuring EHR free-text data, such as rule-based systems and multi-stage pipelines, are often limited by their time-consuming configurations and inability to adapt across clinical notes from diverse healthcare settings. Few systems provide a comprehensive attribute extraction for terminologies. While giant large language models (LLMs) like GPT-4 and LLaMA 405B excel at structuring tasks, they are slow, costly, and impractical for large-scale use. To overcome these limitations, we introduce GENIE, a Generative Note Information Extraction system that leverages LLMs to streamline the structuring of unstructured clinical text into usable data with standardized format. GENIE processes entire paragraphs in a single pass, extracting entities, assertion statuses, locations, modifiers, values, and purposes with high accuracy. Its unified, end-to-end approach simplifies workflows, reduces errors, and eliminates the need for extensive manual intervention. Using a robust data preparation pipeline and fine-tuned small scale LLMs, GENIE achieves competitive performance across multiple information extraction tasks, outperforming traditional tools like cTAKES and MetaMap and can handle extra attributes to be extracted. GENIE strongly enhances real-world applicability and scalability in healthcare systems. By open-sourcing the model and test data, we aim to encourage collaboration and drive further advancements in EHR structurization.

翻译：电子健康记录（EHRs）在推进医疗保健方面具有巨大潜力，它提供了丰富、纵向的数据，将结构化信息与来自非结构化临床笔记的宝贵见解相结合。然而，临床文本的非结构化特性给二次应用带来了重大挑战。用于结构化EHR自由文本数据的传统方法，如基于规则的系统和多阶段流程，通常受限于其耗时的配置以及无法适应来自不同医疗环境的临床笔记。很少有系统能为术语提供全面的属性提取。虽然像GPT-4和LLaMA 405B这样的大型语言模型（LLMs）在结构化任务上表现出色，但它们速度慢、成本高，且不适合大规模使用。为了克服这些限制，我们引入了GENIE，这是一个生成式笔记信息提取系统，它利用LLMs将非结构化临床文本高效地结构化为具有标准化格式的可用数据。GENIE单次处理整个段落，以高准确率提取实体、断言状态、位置、修饰语、数值和目的。其统一、端到端的方法简化了工作流程，减少了错误，并消除了大量人工干预的需求。通过使用稳健的数据准备流程和经过微调的小规模LLMs，GENIE在多个信息提取任务中实现了有竞争力的性能，优于cTAKES和MetaMap等传统工具，并且能够处理待提取的额外属性。GENIE显著增强了医疗保健系统中的实际适用性和可扩展性。通过开源模型和测试数据，我们旨在鼓励合作并推动EHR结构化领域的进一步发展。