Generating synthetic text addresses the challenge of data availability in privacy-sensitive domains such as healthcare. This study explores the applicability of synthetic data in real-world medical settings. We introduce MedSyn, a novel medical text generation framework that integrates large language models with a Medical Knowledge Graph (MKG). We use MKG to sample prior medical information for the prompt and generate synthetic clinical notes with GPT-4 and fine-tuned LLaMA models. We assess the benefit of synthetic data through application in the ICD code prediction task. Our research indicates that synthetic data can increase the classification accuracy of vital and challenging codes by up to 17.8% compared to settings without synthetic data. Furthermore, to provide new data for further research in the healthcare domain, we present the largest open-source synthetic dataset of clinical notes for the Russian language, comprising over 41k samples covering 219 ICD-10 codes.
翻译:生成合成文本旨在解决医疗等隐私敏感领域数据可用性的挑战。本研究探讨了合成数据在真实医疗场景中的适用性。我们提出了MedSyn——一种新颖的医疗文本生成框架,该框架将大语言模型与医疗知识图谱(MKG)相结合。我们利用MKG从提示中采样先验医疗信息,并通过GPT-4和微调后的LLaMA模型生成合成临床记录。我们通过在ICD编码预测任务中的应用来评估合成数据的效益。研究表明,与未使用合成数据的情况相比,合成数据可将关键且具有挑战性的编码分类准确率提升高达17.8%。此外,为给医疗领域的进一步研究提供新数据,我们发布了目前最大的俄语开源合成临床记录数据集,包含超过4.1万个样本,涵盖219个ICD-10编码。