README: Bridging Medical Jargon and Lay Understanding for Patient Education through Data-Centric NLP

from arxiv, To appear in Findings of the Association for Computational Linguistics: EMNLP 2024. We sincerely appreciate the tremendous efforts of the entire README annotation team throughout the expert annotation process of the README dataset

The advancement in healthcare has shifted focus toward patient-centric approaches, particularly in self-care and patient education, facilitated by access to Electronic Health Records (EHR). However, medical jargon in EHRs poses significant challenges in patient comprehension. To address this, we introduce a new task of automatically generating lay definitions, aiming to simplify complex medical terms into patient-friendly lay language. We first created the README dataset, an extensive collection of over 50,000 unique (medical term, lay definition) pairs and 300,000 mentions, each offering context-aware lay definitions manually annotated by domain experts. We have also engineered a data-centric Human-AI pipeline that synergizes data filtering, augmentation, and selection to improve data quality. We then used README as the training data for models and leveraged a Retrieval-Augmented Generation method to reduce hallucinations and improve the quality of model outputs. Our extensive automatic and human evaluations demonstrate that open-source mobile-friendly models, when fine-tuned with high-quality data, are capable of matching or even surpassing the performance of state-of-the-art closed-source large language models like ChatGPT. This research represents a significant stride in closing the knowledge gap in patient education and advancing patient-centric healthcare solutions.

翻译：医疗保健的进步已将焦点转向以患者为中心的方法，特别是在自我护理和患者教育方面，这得益于电子健康记录（EHR）的可及性。然而，EHR中的医学术语对患者理解构成了重大挑战。为解决这一问题，我们引入了一项自动生成通俗定义的新任务，旨在将复杂的医学术语简化为患者友好的通俗语言。我们首先创建了README数据集，这是一个包含超过50,000个独特的（医学术语，通俗定义）对和300,000次提及的广泛集合，每个条目都提供了由领域专家手动标注的上下文感知通俗定义。我们还设计了一个以数据为中心的人机协同流水线，通过数据过滤、增强和选择的协同作用来提高数据质量。随后，我们将README用作模型的训练数据，并利用检索增强生成方法来减少幻觉并提高模型输出的质量。我们广泛的自动和人工评估表明，开源且适用于移动设备的模型在通过高质量数据微调后，能够匹配甚至超越如ChatGPT等最先进的闭源大型语言模型的性能。这项研究代表了在缩小患者教育中的知识差距和推进以患者为中心的医疗保健解决方案方面迈出的重要一步。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日