We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes produced in Emergency Departments of Italian hospitals. The corpus, in its current version, is composed of approximately 4 million clinical notes fully anonymized, covering diverse phases of patient care during the stay in the emergency department. In addition, a subset of about six thousand notes has been manually annotated by clinical experts through a structured Case Report Form (CRF) containing 132 items relevant for two patient situations in emergency departments, dyspnea and loss of consciousness. Items may assume numerical values (e.g., for blood saturation), categorical (e.g., for level of consciousness ), binary (e.g., for presence of traumas), and mixed value types. The annotation process involved multiple clinicians and underwent iterative revision to resolve ambiguities in item formulation, resulting in a richly structured (although high imbalanced) resource. The dataset aims to fill a relevant gap of data able to support both the development and the use of Large Language Models in concrete medical applications. We describe the data collection protocol, the on-site anonymisation pipeline, corpus statistics, and the annotation scheme. Finally, we propose CRF-filling as a novel structured information extraction benchmark, and provide zero-shot baseline resulting from Gemma-27B and MedGemma-27B. To the best of our knowledge, the EDEN dataset is the largest freely available corpus of clinical notes existing for the Italian language.
翻译:我们提出EDEN(急诊科电子笔记),这是一个新颖且独特的大规模语料库,收录了意大利医院急诊科生成的临床笔记。该语料库当前版本包含约400万份完全匿名的临床笔记,覆盖患者在急诊科停留期间的不同护理阶段。此外,约六千份笔记的子集已由临床专家通过结构化病例报告表(CRF)进行人工标注,该表格包含132个条目,涵盖急诊科中两种常见患者情况(呼吸困难和意识丧失)。条目可能包含数值型(如血氧饱和度)、类别型(如意识水平)、二元型(如创伤存在性)及混合值类型。标注过程涉及多名临床医生,并经过迭代修订以解决条目表述中的歧义,最终形成高度结构化的资源(尽管类别分布高度不平衡)。该数据集旨在填补实际医疗应用中支持大语言模型开发与使用的关键数据缺口。我们描述了数据收集协议、现场匿名化管道、语料库统计及标注方案。最后,我们提出将CRF填充作为一种新型结构化信息抽取基准,并提供了Gemma-27B和MedGemma-27B的零样本基线结果。据我们所知,EDEN数据集是现有意大利语临床笔记中最大的免费可用语料库。