Clinical natural language processing requires methods that can address domain-specific challenges, such as complex medical terminology and clinical contexts. Recently, large language models (LLMs) have shown promise in this domain. Yet, their direct deployment can lead to privacy issues and are constrained by resources. To address this challenge, we delve into synthetic clinical text generation using LLMs for clinical NLP tasks. We propose an innovative, resource-efficient approach, ClinGen, which infuses knowledge into the process. Our model involves clinical knowledge extraction and context-informed LLM prompting. Both clinical topics and writing styles are drawn from external domain-specific knowledge graphs and LLMs to guide data generation. Our extensive empirical study across 7 clinical NLP tasks and 16 datasets reveals that ClinGen consistently enhances performance across various tasks, effectively aligning the distribution of real datasets and significantly enriching the diversity of generated training instances. We will publish our code and all the generated data in \url{https://github.com/ritaranx/ClinGen}.
翻译:临床自然语言处理需要能够应对领域特有挑战的方法,例如复杂的医学术语和临床语境。近年来,大语言模型在该领域展现出潜力。然而,直接部署这些模型可能引发隐私问题,且受限于资源。为应对这一挑战,我们深入研究了利用大语言模型生成合成临床文本以支持临床自然语言处理任务的方法。我们提出了一种创新且资源高效的方法——ClinGen,该方法将知识融入生成过程。我们的模型包括临床知识提取与上下文感知的LLM提示引导。临床主题和写作风格均从外部领域知识图谱和大语言模型中提取,以指导数据生成。通过在7项临床自然语言处理任务和16个数据集上的广泛实证研究,我们发现ClinGen在各种任务中持续提升性能,有效对齐真实数据集分布,并显著增强了生成训练实例的多样性。我们将公开代码及所有生成数据,详见\url{https://github.com/ritaranx/ClinGen}。