The advancement in healthcare has shifted focus toward patient-centric approaches, particularly in self-care and patient education, facilitated by access to Electronic Health Records (EHR). However, medical jargon in EHRs poses significant challenges in patient comprehension. To address this, we introduce a new task of automatically generating lay definitions, aiming to simplify complex medical terms into patient-friendly lay language. We first created the README dataset, an extensive collection of over 50,000 unique (medical term, lay definition) pairs and 300,000 mentions, each offering context-aware lay definitions manually annotated by domain experts. We have also engineered a data-centric Human-AI pipeline that synergizes data filtering, augmentation, and selection to improve data quality. We then used README as the training data for models and leveraged a Retrieval-Augmented Generation method to reduce hallucinations and improve the quality of model outputs. Our extensive automatic and human evaluations demonstrate that open-source mobile-friendly models, when fine-tuned with high-quality data, are capable of matching or even surpassing the performance of state-of-the-art closed-source large language models like ChatGPT. This research represents a significant stride in closing the knowledge gap in patient education and advancing patient-centric healthcare solutions.
翻译:医疗保健的进步已将焦点转向以患者为中心的方法,特别是在自我护理和患者教育方面,这得益于电子健康记录(EHR)的可及性。然而,EHR中的医学术语对患者理解构成了重大挑战。为解决这一问题,我们引入了一项自动生成通俗定义的新任务,旨在将复杂的医学术语简化为患者友好的通俗语言。我们首先创建了README数据集,这是一个包含超过50,000个独特的(医学术语,通俗定义)对和300,000次提及的广泛集合,每个条目都提供了由领域专家手动标注的上下文感知通俗定义。我们还设计了一个以数据为中心的人机协同流水线,通过数据过滤、增强和选择的协同作用来提高数据质量。随后,我们将README用作模型的训练数据,并利用检索增强生成方法来减少幻觉并提高模型输出的质量。我们广泛的自动和人工评估表明,开源且适用于移动设备的模型在通过高质量数据微调后,能够匹配甚至超越如ChatGPT等最先进的闭源大型语言模型的性能。这项研究代表了在缩小患者教育中的知识差距和推进以患者为中心的医疗保健解决方案方面迈出的重要一步。